I think eventually I'd like to see OpenRefine Expression editor & Templating Exporter have the ability to use JSONata and being able to transform JSON data and structured data being stored in cells (so 2 new datatypes: JSON, and Array) as we've proposed before in
opened 05:04PM - 30 Apr 22 UTC
Type: Feature Request
logic
records
When we use an expression which returns an array in the `Transform` or `Add colu… mn based on this column` dialogs, the resulting cells are currently left blank because we cannot store an array in a single cell.
This is of course not ideal and quite confusing because there is no explanation or warning about this (#1475, #1088).
### Proposed solution
This should behave like the `Add column from reconciled values` operation: spread multiple values in consecutive cells to form a record.
For instance, if I start with a table such as:
| ID | JSON |
|----|--------------------------|
| 1 | {"letters":["a","b"]} |
| 2 | {"letters":["c","d","e"]} |
If I add a column `letters` generated by the expression `value.parseJson().letters`, I want to get the following result:
| ID | JSON | letters |
|----|--------------------------|---------|
| 1 | {"letters":["a","b"]} | a |
| | | b |
| 2 | {"letters":["c","d","e"]} | c |
| | | d |
| | | e |
This solution could be implemented in 3.x, without being an important breaking change (since expressions which return arrays are currently useless, we can assume that it is unlikely anyone relies on the current behaviour in a workflow).
### Alternatives considered
#### Just keep the logic as it is but improve error reporting
We could simply make it clearer at the preview stage that those values will not be stored. @thadguidry [proposed](https://groups.google.com/g/openrefine/c/aDxquAI4MqY/m/P-KD7mghFAAJ) the following UI:
![image](https://user-images.githubusercontent.com/309908/166115035-ce927a11-ac1f-45a2-8ca1-357ce1180eba.png)
@ostephens has noted that it would be better not to render those as errors (because the expression does not return an error itself) but rather to signal this in a different way (for instance by improving the way we distinguish between data types of returned values in this preview area).
#### Spread the values as columns
In issue #36, it is proposed that values are spread on columns instead. With the example above, you would get:
| ID | JSON | letters 1 | letters 2 | letters 3 |
|----|--------------------------|----------|----------|----------|
| 1 | {"values":["a","b"]} | a | b | |
| 2 | {"values":["c","d","e"]} | c | d | e |
I find that less helpful because storing an array like this makes it hard to reuse the results afterwards.
The lists can have variable lengths, so the columns created by the operation depend on the data.
And this behaviour can already be obtained by successively adding columns using the expression `value.parseJson().letters[0]`, `value.parseJson().letters[1]` and so on (or rather in the reverse order to get the exact result above).
#### Store arrays as strings in cells
@thadguidry also [proposed](https://github.com/OpenRefine/OpenRefine/issues/1088#issuecomment-365633202) to convert these values to strings before storing them, for instance by serializing them to JSON with `jsonize()`.
That is indeed a sensible workaround that users can resort to at the moment, but it does not feel so convenient to reuse afterwards (although it is not as bad as when values are spread as columns in my opinion).
#### Make it possible for cells to store arrays natively
In PR #1586 @ostephens proposed to make it possible to store arrays in cells natively.
That is definitely an interesting approach, which could potentially even remove the need for the records mode entirely: just store lists of values in the cells instead of spreading them on consecutive rows. Hence it seems to be a much bigger project to me. What I find daunting about it is that:
* if we start allowing arrays in cells, we basically need to rethink most operations and decide how they should handle arrays. What happens when you reconcile a cell which contains an array? What happens when one of the transpose operations encounters that? And so on.
* unfortunately, as explained [in a comment on #1586](https://github.com/OpenRefine/OpenRefine/pull/1586#issuecomment-420965752), we currently allow any Java `Object` to be stored as cell value, which means that we cannot rely on the compiler to make sure we are handling this new possibility in all places of the code where we should add this handling. This is also a problem for extension developers, who will not become aware of this unless they read our release notes.
That being said I would be happy to consider this change in 4.0, where we already change to a very different data model - such a change could be introduced there. That does not make the design effort less daunting of course (it is still basically the same problem as redesigning the records mode).
#### Give the user a choice between various of those options
The dialogs where those expressions are previewed could be adapted to give the user a choice in how they want their arrays to be stored. Those expression preview dialogs are already quite full, but if we think there can be multiple sensible things to do there, it is natural to give the user a choice. But it is not really clear to me that it is worth it, because:
* spreading the values in columns could be done using the `Split columns` operation instead (by adapting it to accept an expression returning an array), as suggested by @dfhuynh and others
* storing the arrays as strings can already be done by adding `.jsonize()` at the end of the expression. Arguably this is not as discoverable as an explicit option in the UI to do it for you, but I would expect people who want this to be aware of `parseJson()` at least (otherwise how do they intend to reuse the results?) so it is also quite likely that they know of `jsonize()`
* storing arrays in cells is not something we currently support, so we cannot give the option to do that yet
### Additional context
Related issues I am aware of: #36, #1475, #1088.
This is important in the context of the Structured Data on Commons integration project because the [CommonsExtension](https://github.com/OpenRefine/CommonsExtension) offers GREL functions which return arrays.
I can imagine where the Expression dialog allows an interactive experience with JSONata, as well as the Templating Exporter could have a new option for
JSONata (or JSONata-like) is the kind of experience that I think most of our OpenRefine users would expect us to deliver eventually. It's done by Andrew Coleman and others from IBM where records in DevOps, OpenShift, and general IaaS data in JSON format is quite prevalently used and often needs to be transformed, summarized, queried, etc.
Their 5 min video is so convincing: