Structured Data in Cells (JSON Records and Arrays)

thadguidry · December 28, 2024, 9:03am

I think eventually I'd like to see OpenRefine Expression editor & Templating Exporter have the ability to use JSONata and being able to transform JSON data and structured data being stored in cells (so 2 new datatypes: JSON, and Array) as we've proposed before in

github.com/OpenRefine/OpenRefine

Spread array of values generated by an expression to form a record

opened 05:04PM - 30 Apr 22 UTC

wetneb

Type: Feature Request logic records

When we use an expression which returns an array in the `Transform` or `Add colu…mn based on this column` dialogs, the resulting cells are currently left blank because we cannot store an array in a single cell. This is of course not ideal and quite confusing because there is no explanation or warning about this (#1475, #1088). ### Proposed solution This should behave like the `Add column from reconciled values` operation: spread multiple values in consecutive cells to form a record. For instance, if I start with a table such as: | ID | JSON | |----|--------------------------| | 1 | {"letters":["a","b"]} | | 2 | {"letters":["c","d","e"]} | If I add a column `letters` generated by the expression `value.parseJson().letters`, I want to get the following result: | ID | JSON | letters | |----|--------------------------|---------| | 1 | {"letters":["a","b"]} | a | | | | b | | 2 | {"letters":["c","d","e"]} | c | | | | d | | | | e | This solution could be implemented in 3.x, without being an important breaking change (since expressions which return arrays are currently useless, we can assume that it is unlikely anyone relies on the current behaviour in a workflow). ### Alternatives considered #### Just keep the logic as it is but improve error reporting We could simply make it clearer at the preview stage that those values will not be stored. @thadguidry [proposed](https://groups.google.com/g/openrefine/c/aDxquAI4MqY/m/P-KD7mghFAAJ) the following UI: ![image](https://user-images.githubusercontent.com/309908/166115035-ce927a11-ac1f-45a2-8ca1-357ce1180eba.png) @ostephens has noted that it would be better not to render those as errors (because the expression does not return an error itself) but rather to signal this in a different way (for instance by improving the way we distinguish between data types of returned values in this preview area). #### Spread the values as columns In issue #36, it is proposed that values are spread on columns instead. With the example above, you would get: | ID | JSON | letters 1 | letters 2 | letters 3 | |----|--------------------------|----------|----------|----------| | 1 | {"values":["a","b"]} | a | b | | | 2 | {"values":["c","d","e"]} | c | d | e | I find that less helpful because storing an array like this makes it hard to reuse the results afterwards. The lists can have variable lengths, so the columns created by the operation depend on the data. And this behaviour can already be obtained by successively adding columns using the expression `value.parseJson().letters[0]`, `value.parseJson().letters[1]` and so on (or rather in the reverse order to get the exact result above). #### Store arrays as strings in cells @thadguidry also [proposed](https://github.com/OpenRefine/OpenRefine/issues/1088#issuecomment-365633202) to convert these values to strings before storing them, for instance by serializing them to JSON with `jsonize()`. That is indeed a sensible workaround that users can resort to at the moment, but it does not feel so convenient to reuse afterwards (although it is not as bad as when values are spread as columns in my opinion). #### Make it possible for cells to store arrays natively In PR #1586 @ostephens proposed to make it possible to store arrays in cells natively. That is definitely an interesting approach, which could potentially even remove the need for the records mode entirely: just store lists of values in the cells instead of spreading them on consecutive rows. Hence it seems to be a much bigger project to me. What I find daunting about it is that: * if we start allowing arrays in cells, we basically need to rethink most operations and decide how they should handle arrays. What happens when you reconcile a cell which contains an array? What happens when one of the transpose operations encounters that? And so on. * unfortunately, as explained [in a comment on #1586](https://github.com/OpenRefine/OpenRefine/pull/1586#issuecomment-420965752), we currently allow any Java `Object` to be stored as cell value, which means that we cannot rely on the compiler to make sure we are handling this new possibility in all places of the code where we should add this handling. This is also a problem for extension developers, who will not become aware of this unless they read our release notes. That being said I would be happy to consider this change in 4.0, where we already change to a very different data model - such a change could be introduced there. That does not make the design effort less daunting of course (it is still basically the same problem as redesigning the records mode). #### Give the user a choice between various of those options The dialogs where those expressions are previewed could be adapted to give the user a choice in how they want their arrays to be stored. Those expression preview dialogs are already quite full, but if we think there can be multiple sensible things to do there, it is natural to give the user a choice. But it is not really clear to me that it is worth it, because: * spreading the values in columns could be done using the `Split columns` operation instead (by adapting it to accept an expression returning an array), as suggested by @dfhuynh and others * storing the arrays as strings can already be done by adding `.jsonize()` at the end of the expression. Arguably this is not as discoverable as an explicit option in the UI to do it for you, but I would expect people who want this to be aware of `parseJson()` at least (otherwise how do they intend to reuse the results?) so it is also quite likely that they know of `jsonize()` * storing arrays in cells is not something we currently support, so we cannot give the option to do that yet ### Additional context Related issues I am aware of: #36, #1475, #1088. This is important in the context of the Structured Data on Commons integration project because the [CommonsExtension](https://github.com/OpenRefine/CommonsExtension) offers GREL functions which return arrays.

I can imagine where the Expression dialog allows an interactive experience with JSONata, as well as the Templating Exporter could have a new option for

template using JSONata?

JSONata (or JSONata-like) is the kind of experience that I think most of our OpenRefine users would expect us to deliver eventually. It's done by Andrew Coleman and others from IBM where records in DevOps, OpenShift, and general IaaS data in JSON format is quite prevalently used and often needs to be transformed, summarized, queried, etc.

Their 5 min video is so convincing:

Topic		Replies	Views
Representing hierarchical data: beyond the records mode? Design proposals	9	836	February 4, 2024
Blank result for a long GREL expression parsing XML data Support and Helpdesk	3	260	November 26, 2023
Preserve record mode hierarchy for export / templating Support and Helpdesk	4	69	December 30, 2024
GREL function previews ok, then fails Data cleaning and transformations	4	383	April 27, 2023
Help with splitting a json array Support and Helpdesk	4	267	May 14, 2024

Structured Data in Cells (JSON Records and Arrays)

Related topics