JSON: Parsing multiple arrays

mcyzyk · October 24, 2023, 2:30pm

I am learning a lot from Owen Stephens' example here:

But my data is a bit different. I have a chunk of JSON and within it is a single array consisting of multiple objects:

[
{"label":"Title","value":"Caarte Van West Indien, Soo Vaste landen als Eylanden, Nieuwelyckx Met Veele Perfexien Gecorigeert Ende Verbeetert, ende alle Clippen, Droochteen, Ree:en, Haavens, Rivieren, steeden, en Bergen et. daar in genoteert soo die ghelege zyn"},
{"label":"Title","value":"Heroes hodierni arconautæ in occidio of hedendaagscher helden-tocht met oorlogs vloten en heirlegers over zee getrocken. Virg: erunt etiam altea bella oceam finem juxta solemqsz cadentem tot Amsterdam, By Gerard Valck en Petrus Schenk"},
{"label":"Author/creator","value":"Vingboons, Johannes (1616/17-1670)"},
{"label":"Subject (topical)","value":"Unspecified"},
{"label":"Country","value":{"@value":"Netherlands","@language":"eng"}},
{"label":"Date issued","value":"2016 "},
{"label":"Publication date","value":"2016 "},
{"label":"Language","value":{"@value":"Dutch","@language":"eng"}},
{"label":"Language","value":{"@value":"Latin","@language":"eng"}},
{"label":"Extent","value":"1 digital resource"},
{"label":"Note","value":["On the bottom: ‘Joan Vinc : Boons fecit. Et exud.’","3 scale sticks."]},
{"label":"Subject (topical)","value":"74.26 geography of Central and South America"},
{"label":"Subject (geographic)","value":"Florida"},
{"label":"Subject (geographic)","value":"Caribbean region"},
{"label":"Subject (geographic)","value":"Gulf of Mexico"},
{"label":"Subject (geographic)","value":"Mexico"},
{"label":"Subject (geographic)","value":"Central America"},
{"label":"Published","value":"Amsterdam: Gerard Valck en Petrus Schenk, [circa 1700]"},
{"label":"Form","value":"copper engraving, in outline colour ; 51,1 x 71 cm"},
{"label":"Shelfmark","value":"COLLBN Port 189 N 1"},
]

So "label" "value" in each object.

I cannot guarantee the order of these objects.

I must loop over them and test each to see what it is. So, I will run one loop testing for Title and create a new OpenRefine column based on that. Another testing for Subjects, and so on.

Not sure how to do this.

More, this is an issue:

{"label":"Language","value":{"@value":"Latin","@language":"eng"}},

I need to get at the value "Latin" in this object, which appears to be an object within an object.

Anyway, I am searching for recipes in this regard.

I'm starting with Owen Stephens' example:

filter(forEach(value.parseJson()["categories"],v,v.toLowercase()),w,w=="chinese").uniques().join("")

So something like?

filter(forEach(value.parseJson()[NumberOfObjectsHere],v,v.toLowercase()),w,w=="If Label Equals 'Title'").uniques().join("")

How's that for pseudo-code! Advice appreciated.

OpenRefine Newby,

Mark

b2m · October 25, 2023, 7:00am

Hi Mark,

this is quite the detailed question! Thanks for taking the effort to provide your thoughts and examples and a reference where you started!

Before diving into GREL I have one remaining question: is this "junk of JSON" coming from an API call or so and you want to add it to an existing project. Or is this "junk of JSON" the start of your project?

I am asking because OpenRefine already offers to convert JSON data into tabular data when creating a project. And the strategy for this would be different then parsing JSON with GREL.

Cheers
Benjamin

mcyzyk · October 25, 2023, 12:55pm

Benjamin,

That chunk of JSON is coming from the metadata section of an IIIF Manifest, which was served from an Islandora content management system in the Libraries at the University of Leiden, Netherlands. I'm working on a project related to Caribbean Maps, 1450-1850, and am using OpenRefine to harvest/extract metadata on the Dutch maps this week. (Last week was French...)

With THEE WONDERFUL and POWERFUL OpenRefine I've been able to download HTML records, mine them for metadata, figure out URLs to IIIF Manifests, and download those, too. Now I'm seeing these valuable chunks of data in them which I intend to extract into separate fields in OpenRefine.

As I was falling asleep last night I thought: I really need to back up with this GREL expression and do first things first: That would be simply getting a forEach() loop working over this array of objects. Then, the more difficult part will be (I think) identifying if the current object in the loop is, for example, a Title object. If so, grab the value for that object and proceed with the loop.

Looking into this now, this morning (between meetings!).

Thanks, and best regards,

Mark

b2m · October 25, 2023, 2:58pm

Okay, so instead of filtering for certain keywords here is a more generic approach:

Write all the labels in a new column "labels".
Write all values in a new column "values".
Split the columns "labels" and "values" into rows.
Fix remaining JSON structures (e.g. nested values) in column "values".
Use Transpose or similar methods to produce column based data from your row based data.
Delete unnecessary columns.

For step 1 you could use the following GREL expression:

forEach(value.parseJson(), v, v["label"]).join("||")

For step 2 you could use the following GREL expression:

forEach(value.parseJson(), v, v["value"]).join("||")

Both GREL expressions are using the separator || that should be used in step 3 for splitting the multi-valued cells.

You then will still have some JSON structures in the column values.
So for step 4 you could use something like:

if(
  value.startsWith("{"),
  value.parseJson()["@value"],
  if(
    value.startsWith("["),
    value.parseJson().join(", "),
    value
  )
)

In the first if-clause the values from the nested objects are extracted.
In the second if-clause arrays are concatenated using , . Not sure what you would prefer, just adapt it =).

I hope this helps you finding a working strategy for your data.

mcyzyk · October 25, 2023, 3:46pm

So useful!

Many thanks,

Mark

tfmorris · October 26, 2023, 1:23am

It looks like the manifest for a PURL like https://digitalcollections.universiteitleiden.nl/view/item/53252 is available at https://digitalcollections.universiteitleiden.nl/iiif_manifest/item:53252/manifest

As Benjamin hinted at earlier, you might try importing the JSON from that URL to see if it gets you a better starting point. Selecting the first metadata entry as a template will give you most of the data imported in two-column, label & value, format.

The language tagging of some, but not all, values is a little awkward. Here's another expression snippet that might help with step #4:

forEach(value.parseJson().metadata, i, with(i.value["@value"],v,if(isNotNull(v), v, i.value)))

Tom

Topic		Replies	Views
Help with splitting a json array Support and Helpdesk	4	267	May 14, 2024
Structured Data in Cells (JSON Records and Arrays) Development & Design	0	55	December 28, 2024
Preserve record mode hierarchy for export / templating Support and Helpdesk	4	69	December 30, 2024
Could Someone Give me Advice with Complex Data Transformation in OpenRefine? Support and Helpdesk	3	74	July 26, 2024
Two ways of feature extraction: 1. split multi-value cells; 2. split into multiple columns Data cleaning and transformations	6	1246	March 24, 2023

JSON: Parsing multiple arrays

Related topics