Representing hierarchical data: beyond the records mode?

antonin_d · February 7, 2023, 8:14am

I am opening this thread to give a summary of an important open problem in OpenRefine’s design.
The goal is to encourage people to come up with alternative design proposals or pointers to solutions from other tools.
GitHub issues relevant to this are normally classified under the “records” tag. Here are a few places where such a discussion has happened before:

#2825
The future of the records mode
#2298, which proposes to investigate the Dremel encoding: “a novel columnar storage format for nested data, with algorithms for dissecting nested records into columns and reassembling them”

What is the records mode

OpenRefine projects are tables, or grids: they have a certain number of rows and columns, and a cell for each row/column combination.

This tabular representation works well in many cases. For instance, when working on a list of people: each person has a single date of birth, a single place of birth, a single height, so you can have each of those attributes in their own column, and have one row per person.

However, we often need to represent attributes which are not single-valued. For instance, a given person can have multiple hobbies or multiple email addresses. So it gets less convenient to represent them in a table.

OpenRefine’s records mode provides a workaround for such cases. The idea is to add additional lines such that each value can be stored in its own cell:

Person	Date of birth	Email address	Hobbies
Amanda	1978-07-03	amanda@brown.com	birdwatching
		amanda1787@hootmail.com	photography
			running
Peter	1986-11-23	peter.super@gimail.com	singing
		peter@southwest.us
		peter8976@hautmail.com

In this example, rows one to three form a record, and rows four to six another one. The separation between records depends on the first column: every row with a non-blank value in the first column defines a new record.

OpenRefine’s UI offers a switch between the rows and records mode. In rows mode, the records structure is ignored: all rows are treated independently of each other. In records mode however, most operations are applied for each record, making it for instance possible to manipulate all hobbies of a given person within the same GREL expression.

What are the problems with the records mode

Lack of support for hierarchical data of depth greater than one

Often, it is not enough to be able to have multi-valued attributes for the entities you are representing: those multiple values might have a structure of their own as well.
For example (following the user manual), your key column may be a film or television show, with multiple cast members identified by name, associated to that work. You may have one or more roles listed for each person. The roles are linked to the actors, which are linked to the title.

You can represent this in the spirit of the records mode, as follows:

Work	Actor	Role
The Wizard of Oz	Judy Garland	Dorothy Gale
	Ray Bolger	“Hunk”
		The Scarecrow
	Jack Haley	“Hickory”
		The Tin Man
	Bert Lahr	“Zeke”
		The Cowardly Lion
	Frank Morgan	Professor Marvel
		The Gatekeeper
		The Carriage Driver
		The Guard
		The Wizard of Oz
	Margaret Hamilton	Miss Almira Gulch
		The Wicked Witch of the West

The problem with this representation is that OpenRefine is not able to detect that each role is associated to a particular actor, and not to the containing work itself. The column groups feature was intended to add some column metadata that makes it possible to represent such a fact, but this feature has never been implemented fully and is not usable as such.

The JSON and XML importers use the structure above to coerce their hierarchical data formats into an OpenRefine table.
However, even with column groups, this conversion is lossy, meaning that we cannot export back the data into the original JSON or XML document (#1897).
This is a frequently requested feature.

Unclear differences between operations in rows and records mode and overall confusing user experience

When running an operation, it is not always clear what difference it makes whether we run it in rows or records mode.
Also, OpenRefine tends to switch automatically to the records mode (for instance when creating a project where the first column contains some blank values), and this can be confusing, even for users who know how the records mode works.

Overall, the records mode is far from intuitive and requires quite some talent for trainers to explain how it works.

Alternatives

Storing multiple values in a cell

You can combine those values in a single cell, by joining them with some separator.
OpenRefine offers operation to switch between the records representation and such a joined representation.
Working only in the joined representation is currently difficult: for instance, if you wanted to trim the whitespace around the hobbies in your dataset, and those hobbies were represented joined values in single
cells (such as knitting # ice-skating# playing board games if using # as a separator), you would have to use a more elaborate GREL function than value.trim(), as this would only remove the leading whitespace before the first hobby and the trailing whitespace after the last hobby.

Similarly, if you want to reconcile each value, you cannot do so while they are stored in a single cell, since one cell can be reconciled to at most one entity at a time.
Another downside of this joined representation is that it requires the user to pick a separator which is guaranteed not to appear in cell values, otherwise they would be effectively splitting existing values into smaller parts.
Also, it does not really address the need to represent deeper hierarchical structures.

Better support for JSON/XML values in a cell

One could extend cells to be able to store not just basic datatypes (string, date, number, boolean…) but also hierarchical data, such as JSON or XML objects.

This is currently already possible, by keeping those hierarchical values as strings (and parsing them on the go with parseJson() or parseXml() as required). By adding built-in support for such structures as possible datatypes, we would likely be able
to remove this parsing step, offering a clearer rendering of those values in the grid, and probably other improvements in the user experience.

However, it is unclear how that would allow users to reconcile values inside those hierarchical data types, since the reconciliation data would have to be embedded deep down into the JSON or XML.

A functional version of the column groups feature

Perhaps there could be a way to improve the existing notion of column groups (in 3.x) and turn it into something really usable.

thadguidry · February 7, 2023, 1:29pm

I think it’s likely that JSON will bring us into the long-tail of a solution, especially given that all the JSONPath libraries are well suited for the various tasks that are needed for modifying, filtering, evaluating based an expression. A definite use case will be matching on an object in the record structure with a key matching the query. $[?(@.id==2)] and not only equaling a string $[?(@.key=="value")] where either result could then be used as a value to reconcile. It feels important that perhaps we let users reconcile those results without the extra burden of creating and storing the values in a new column first, which might be a very long running operation that just wastes their time just for discovery purposes with a later recon pass. I can visualize a UI that would allow a user to select via JSONPath expression and then with that selection expression, run a recon to see if they get any hits and if they do then they would likely perform the real substructure extraction into additional columns as necessary or warranted.

So a visual expression selector would be nice for that. It might offer a preview or directly just do highlighting in the record view of cells, similar to how Regex matching highlighting works in online Regex tools.

I also don’t think we want to keep a “grid” view in our traditional sense for hierarchical data. We will likely need 2 or 3 view types as I hinted.

Also, YES, to Dremel encoding…it’s exactly what Greenplum database also uses for ultra-efficiency - Data in Hadoop section towards bottom.

The records redesign will be definitely a 2 prong approach for sure:

a storage format for efficient querying of complex nested structures. Maybe Parquet-MR, sure, with RLE which supersedes BITPACKED now.
and visualization strategies.

Anyways, I want to stay out of the architecture business 1. , and only help in 2.

Michael_Markert · February 7, 2023, 3:40pm

Actually it is possible to preserve complex structures, but it needs a lot of effort as you instantly get dozens of columns you need to keep track of and address individually in a custom export script. And you have to work with placeholders for empty elements all the time. I for example construct something like

Erlichshausen;Konrad von;§;§;§|§;§;Conradus von Elrichshausen;§;§|§;§;Conradus;Fürst;2

from different columns as a cell value which might result in

"gndo:preferredNameOfThePerson": {
    "@type": "Person",
    
"gndo:forename":"Konrad von",
"gndo:surname":"Erlichshausen",
"gndo:personalName":null,
"gndo:nameAddition":null,
"gndo:counting":null,
    },
"gndo:variantNameOfThePerson": [
    {
"gndo:forename":null,
"gndo:surname":null,
"gndo:personalName":"Conradus von Elrichshausen",
"gndo:nameAddition":null,
"gndo:counting":null,
},
    {
"gndo:forename":null,
"gndo:surname":null,
"gndo:personalName":"Conradus",
"gndo:nameAddition":"Fürst",
"gndo:counting":"2",
}

during custom export as I split the value and process the elements differently with some hard to read GREL expressions like

"gndo:preferredNameOfThePerson": {
    "@type": "Person",
    {{with(cells["nameOfThePerson"].value.split("|")[0], v,
    "\n\"gndo\:forename\"\:" + if(v.split(";")[1] != "§", jsonize(v.split(";")[1]), "null") + "\," +
    "\n\"gndo\:surname\"\:" + if(v.split(";")[0] != "§", jsonize(v.split(";")[0]), "null") + "\," +
    "\n\"gndo\:personalName\"\:" + if(v.split(";")[2] != "§", jsonize(v.split(";")[2]), "null") + "\," +
    "\n\"gndo\:nameAddition\"\:" + if(v.split(";")[3] != "§", jsonize(v.split(";")[3]), "null") + "\," +
    "\n\"gndo\:counting\"\:" + if(v.split(";")[4] != "§", jsonize(v.split(";")[4]), "null") + "\," 
    )}}
    },
"gndo:variantNameOfThePerson": [
    {{forEachIndex(cells["nameOfThePerson"].value.split("|"), i, v,
    if(cells["nameOfThePerson"].value.split("|").length()==1,'
    \{
    \"@type\"\: \"Person\",    
    \"gndo:forename\"\: null,
    \"gndo:surname\"\: null,
    \"gndo:personalName\"\: null,
    \"gndo:nameAddition\"\: null,
    \"gndo:counting\"\: null,
    \}
    ', '') +
    if(i !=0, with(cells["nameOfThePerson"].value.split("|")[i], v,
    "{" + "\n\"gndo\:forename\"\:" + if(v.split(";")[1] != "§", jsonize(v.split(";")[1]), "null") + "\," +
    "\n\"gndo\:surname\"\:" + if(v.split(";")[0] != "§", jsonize(v.split(";")[0]), "null") + "\," +
    "\n\"gndo\:personalName\"\:" + if(v.split(";")[2] != "§", jsonize(v.split(";")[2]), "null") + "\," +
    "\n\"gndo\:nameAddition\"\:" + if(v.split(";")[3] != "§", jsonize(v.split(";")[3]), "null") + "\," +
    "\n\"gndo\:counting\"\:" + if(v.split(";")[4] != "§", jsonize(v.split(";")[4]), "null") + "\," +
    "\n\}" +
    if(cells["nameOfThePerson"].value.split("|").length()!=i+1,'\,\n    ','')
    ), "")
    )}}

thadguidry · April 14, 2023, 5:45am

With this commit: Change detection of presence of records · OpenRefine/OpenRefine@64c552b (github.com)

I still don’t know if this commit introduces the concept of records that are key-based? I see the tests are still only using null or empty as the value in column 1? When will we get records-mode that can also be fully key-based? I.E. it’s filled down in a column (no blanks or nulls), and values that are identical in this key-column or key-pattern (currently, limited to column 1 in OpenRefine, but should be any column) have their rows treated as part of their record?

QUESTION 1: What’s the overall plan to support key-based column operations, in general, in 4.0 timeline?

Future State Example:

key_or_ID	hasFeatureA	passedTests	GREL_index
1		test1	row.record.index = 0
1	true	test2	row.record.index = 0
1		test3	row.record.index = 0
2	true	test1	row.record.index = 1
2	false	test2	row.record.index = 1
2		test3	row.record.index = 1
3			row.record.index = 2
3	false	test2	row.record.index = 2
3		test3	row.record.index = 2

The key could be in column 1, or it could be in any column, or it could be defined by the user through a loose pattern, as suggested in Provide a way to Create Records and Groups by a Row pattern · Issue #2023 · OpenRefine/OpenRefine · GitHub

QUESTION 2: Do we have an issue for the above? I looked through Issues · OpenRefine/OpenRefine (github.com) but didn’t see one that fit the key-based record grouping need.

antonin_d · April 14, 2023, 7:45am

Hi Thad,
No, the commit you linked to does not change the definition of what records are - just when the records mode is turned on automatically.

I don’t have any plans to implement the sort of grouping you are hinting at in 4.0 so far. My plan is just to do scaling and reproducibility improvements while leaving the definition of the records mode identical.

I think your idea is going in some direction that could be more intuitive for users, so it’s definitely worth thinking about, but I think on this issue we need a big design effort to ensure any successor of the existing records mode is workable in a lot of different use cases. And if it can also be made efficient, it’s also nice of course!

thadguidry · April 14, 2023, 7:48am

Shall we make an issue then? or do you think it’s covered in the EPIC records redesign issue or linked to in another issue?

antonin_d · April 14, 2023, 9:31am

I don’t know, perhaps I’d rather only create an issue once there is a clear proposal ready to be implemented, but that’s just a feeling.

thadguidry · December 8, 2023, 12:18pm

@wetneb I think that the idea of "nodes" or "containers" is closer to my original vision I pitched to David long ago (some ETL tools, very expensive, had this). The idea is that you can interactively see and move children nodes to new parents, and vice-versa.

Imagine this being a single record row, and where the boxes are colored (not black bordered grid like we have now) to visually present a much cleaner look to closely represent record data blocks just like JSON hierarchies are. Imagine inside each block or grid cell is the substructure of record data text. So the blocks represent JSON objects, I think?, in essence.

Nested grids demo (gridstackjs.com)
and
Advanced Nested grids demo
Click once on it's Create button to add a 2nd record, now click and hold on any box nodes and move them into new parents. Now move parents into new target parents. Alternatively add new records with "Add Widget buttons"

You can imagine that substructures need to be cleaned up as a whole, and that is where further facets can help to for example: "show me only records that contain a key called 'address' ". And then a user can with one mouse move, to move all the 'address' nodes into a new existing parent structure, or even type and create one. Besides a great UI for moving substructures of data around enmasse, this also presents a nice way to do substructure or record "key" selection for further facets or transforms against records that have a particular selected "key" object (although that demo doesn't do node selection highlighting when you click on a box, but it could).

This was the original ideas of Flexible Data Presentation, where the JSON would be used as our IR for Flexible Data Representation.

thadguidry · December 8, 2023, 1:25pm

Additional use case where I described with example JSON data and Record container relationships Responsivity could be helping more with container parent/child rearrangement (dynamic) · Issue #70 · sneikki/tidgrid · GitHub

thadguidry · February 4, 2024, 12:39am

I really like Groovy's syntax of the Spread operator to call an action on each item and then collecting the result into a list. This seems like it can be very useful for performing actions on nested Records, for example, and I can imagine GREL having a shortcut like that eventually to expand on operations against row.record (or wherever we eventually move the future record model under) instead of forEach()

Topic		Replies	Views
Preserve record mode hierarchy for export / templating Support and Helpdesk	4	68	December 30, 2024
Convert your rows of data into multi-row records Support and Helpdesk	4	353	January 20, 2024
Edit Column Metadata Design proposals	0	184	January 13, 2024
Structured Data in Cells (JSON Records and Arrays) Development & Design	0	54	December 28, 2024
Issue with OpenRefine schema Support and Helpdesk wikidata	3	249	August 2, 2023