Parsing XML from WorldCat records

Hi and welcome @simms29

If I've understood correctly, you want to extract the values of the dc:subject element ONLY when the element has the attribute/value xsi:type="http://purl.org/dc/terms/LCSH". My answer below is based on this assumption. If I've misunderstood please expand on what you are aiming for.

The GREL:

value.parseXml().select("dc|subject[xsi:type=http://purl.org/dc/terms/LCSH]")

should select all elements of this type with this attribute/value. It outputs an array (list) of xml elements, so to get all the values from them, you need to use the forEach function which can be used to carryout some action on each thing in an array.

So for example:

forEach(value.parseXml().select("dc|subject[xsi:type=http://purl.org/dc/terms/LCSH]"),x,x.ownText())

will get you an array (list) of the LCSH from the xml.

Finally, you cannot store an array of things in a cell in OpenRefine, so you'll need to convert this array into a text string to finish up. So your complete GREL transform could look like:

forEach(value.parseXml().select("dc|subject[xsi:type=http://purl.org/dc/terms/LCSH]"),x,x.ownText()).join("|")

This will give you a result which is a pipe-separated list of LCSH from your XML

2 Likes