I am opening this thread to give a summary of an important open problem in OpenRefine’s design.
The goal is to encourage people to come up with alternative design proposals or pointers to solutions from other tools.
GitHub issues relevant to this are normally classified under the “records” tag. Here are a few places where such a discussion has happened before:
- #2825
- The future of the records mode
- #2298, which proposes to investigate the Dremel encoding: “a novel columnar storage format for nested data, with algorithms for dissecting nested records into columns and reassembling them”
What is the records mode
OpenRefine projects are tables, or grids: they have a certain number of rows and columns, and a cell for each row/column combination.
This tabular representation works well in many cases. For instance, when working on a list of people: each person has a single date of birth, a single place of birth, a single height, so you can have each of those attributes in their own column, and have one row per person.
However, we often need to represent attributes which are not single-valued. For instance, a given person can have multiple hobbies or multiple email addresses. So it gets less convenient to represent them in a table.
OpenRefine’s records mode provides a workaround for such cases. The idea is to add additional lines such that each value can be stored in its own cell:
Person | Date of birth | Email address | Hobbies |
---|---|---|---|
Amanda | 1978-07-03 | amanda@brown.com | birdwatching |
amanda1787@hootmail.com | photography | ||
running | |||
Peter | 1986-11-23 | peter.super@gimail.com | singing |
peter@southwest.us | |||
peter8976@hautmail.com |
In this example, rows one to three form a record, and rows four to six another one. The separation between records depends on the first column: every row with a non-blank value in the first column defines a new record.
OpenRefine’s UI offers a switch between the rows and records mode. In rows mode, the records structure is ignored: all rows are treated independently of each other. In records mode however, most operations are applied for each record, making it for instance possible to manipulate all hobbies of a given person within the same GREL expression.
What are the problems with the records mode
Lack of support for hierarchical data of depth greater than one
Often, it is not enough to be able to have multi-valued attributes for the entities you are representing: those multiple values might have a structure of their own as well.
For example (following the user manual), your key column may be a film or television show, with multiple cast members identified by name, associated to that work. You may have one or more roles listed for each person. The roles are linked to the actors, which are linked to the title.
You can represent this in the spirit of the records mode, as follows:
Work | Actor | Role |
---|---|---|
The Wizard of Oz | Judy Garland | Dorothy Gale |
Ray Bolger | “Hunk” | |
The Scarecrow | ||
Jack Haley | “Hickory” | |
The Tin Man | ||
Bert Lahr | “Zeke” | |
The Cowardly Lion | ||
Frank Morgan | Professor Marvel | |
The Gatekeeper | ||
The Carriage Driver | ||
The Guard | ||
The Wizard of Oz | ||
Margaret Hamilton | Miss Almira Gulch | |
The Wicked Witch of the West |
The problem with this representation is that OpenRefine is not able to detect that each role is associated to a particular actor, and not to the containing work itself. The column groups feature was intended to add some column metadata that makes it possible to represent such a fact, but this feature has never been implemented fully and is not usable as such.
The JSON and XML importers use the structure above to coerce their hierarchical data formats into an OpenRefine table.
However, even with column groups, this conversion is lossy, meaning that we cannot export back the data into the original JSON or XML document (#1897).
This is a frequently requested feature.
Unclear differences between operations in rows and records mode and overall confusing user experience
When running an operation, it is not always clear what difference it makes whether we run it in rows or records mode.
Also, OpenRefine tends to switch automatically to the records mode (for instance when creating a project where the first column contains some blank values), and this can be confusing, even for users who know how the records mode works.
Overall, the records mode is far from intuitive and requires quite some talent for trainers to explain how it works.
Alternatives
Storing multiple values in a cell
You can combine those values in a single cell, by joining them with some separator.
OpenRefine offers operation to switch between the records representation and such a joined representation.
Working only in the joined representation is currently difficult: for instance, if you wanted to trim the whitespace around the hobbies in your dataset, and those hobbies were represented joined values in single
cells (such as knitting # ice-skating# playing board games
if using #
as a separator), you would have to use a more elaborate GREL function than value.trim()
, as this would only remove the leading whitespace before the first hobby and the trailing whitespace after the last hobby.
Similarly, if you want to reconcile each value, you cannot do so while they are stored in a single cell, since one cell can be reconciled to at most one entity at a time.
Another downside of this joined representation is that it requires the user to pick a separator which is guaranteed not to appear in cell values, otherwise they would be effectively splitting existing values into smaller parts.
Also, it does not really address the need to represent deeper hierarchical structures.
Better support for JSON/XML values in a cell
One could extend cells to be able to store not just basic datatypes (string, date, number, boolean…) but also hierarchical data, such as JSON or XML objects.
This is currently already possible, by keeping those hierarchical values as strings (and parsing them on the go with parseJson()
or parseXml()
as required). By adding built-in support for such structures as possible datatypes, we would likely be able
to remove this parsing step, offering a clearer rendering of those values in the grid, and probably other improvements in the user experience.
However, it is unclear how that would allow users to reconcile values inside those hierarchical data types, since the reconciliation data would have to be embedded deep down into the JSON or XML.
A functional version of the column groups feature
Perhaps there could be a way to improve the existing notion of column groups (in 3.x) and turn it into something really usable.