Parsing XML from WorldCat records

ostephens · August 2, 2024, 4:22pm

Hi and welcome @simms29

simms29:

I'm new to this. I have worldcat XML records in Open Refine and am trying to extract from said XML lines like this:
<dc:subject xsi:type="http://purl.org/dc/terms/LCSH">Perception--Physiological aspects.</dc:subject>
So that I can extract the subject words: Perception--Physiological aspects.

If I've understood correctly, you want to extract the values of the dc:subject element ONLY when the element has the attribute/value xsi:type="http://purl.org/dc/terms/LCSH". My answer below is based on this assumption. If I've misunderstood please expand on what you are aiming for.

The GREL:

value.parseXml().select("dc|subject[xsi:type=http://purl.org/dc/terms/LCSH]")

should select all elements of this type with this attribute/value. It outputs an array (list) of xml elements, so to get all the values from them, you need to use the forEach function which can be used to carryout some action on each thing in an array.

So for example:

forEach(value.parseXml().select("dc|subject[xsi:type=http://purl.org/dc/terms/LCSH]"),x,x.ownText())

will get you an array (list) of the LCSH from the xml.

Finally, you cannot store an array of things in a cell in OpenRefine, so you'll need to convert this array into a text string to finish up. So your complete GREL transform could look like:

forEach(value.parseXml().select("dc|subject[xsi:type=http://purl.org/dc/terms/LCSH]"),x,x.ownText()).join("|")

This will give you a result which is a pipe-separated list of LCSH from your XML

Topic		Replies	Views
Meta --> name --> content. GREL expression? Support and Helpdesk	6	208	October 8, 2023
Extract data from one column and move to a new column Data cleaning and transformations	10	1040	November 24, 2023
xmlText() not working (like htmlText()) Support and Helpdesk	2	169	November 14, 2023
HTML parsing (MARC data) Support and Helpdesk	6	521	July 13, 2023
Filter on a list of values when parsing JSON Support and Helpdesk	4	39	April 18, 2025

Parsing XML from WorldCat records

Related topics