Parsing XML from WorldCat records

I'm new to this. I have worldcat XML records in Open Refine and am trying to extract from said XML lines like this:

<dc:subject xsi:type="http://purl.org/dc/terms/LCSH">Perception--Physiological aspects.</dc:subject>

So that I can extract the subject words: Perception--Physiological aspects.

There are also XML lines like this in same record:

<dc:subject>molecular biology.</dc:subject>

and I have been successful extracting just dc.subject using GREL

But the data in dc.subject alone is shoddy and not available for most records, hence I want to get the subject headings from the XML lines as described above.

I've looked at the recipes in github, and have sought several others for help, but we're not anywhere closer to extracting the subject headings I want. I've tried parseXml and a host of other things described in the documentation.

Thoughts are most welcome. Kindly, Jen Simms

Hi and welcome @simms29

If I've understood correctly, you want to extract the values of the dc:subject element ONLY when the element has the attribute/value xsi:type="http://purl.org/dc/terms/LCSH". My answer below is based on this assumption. If I've misunderstood please expand on what you are aiming for.

The GREL:

value.parseXml().select("dc|subject[xsi:type=http://purl.org/dc/terms/LCSH]")

should select all elements of this type with this attribute/value. It outputs an array (list) of xml elements, so to get all the values from them, you need to use the forEach function which can be used to carryout some action on each thing in an array.

So for example:

forEach(value.parseXml().select("dc|subject[xsi:type=http://purl.org/dc/terms/LCSH]"),x,x.ownText())

will get you an array (list) of the LCSH from the xml.

Finally, you cannot store an array of things in a cell in OpenRefine, so you'll need to convert this array into a text string to finish up. So your complete GREL transform could look like:

forEach(value.parseXml().select("dc|subject[xsi:type=http://purl.org/dc/terms/LCSH]"),x,x.ownText()).join("|")

This will give you a result which is a pipe-separated list of LCSH from your XML

1 Like

I, too, am new to parsing xml in OpenRefine. I was able to use the above answer to help me solve one part of a problem I am having, so thank you!

I am working with marccxml from the OCLC Metadata API and the data is returned like this:

   </datafield>
    <datafield tag="650" ind1=" " ind2="0">
      <subfield code="a">Agricultural exhibitions</subfield>
      <subfield code="z">Greece.</subfield>
    </datafield>
    <datafield tag="611" ind1="2" ind2="0">
      <subfield code="a">Panama-Pacific International Exposition</subfield>
      <subfield code="d">(1915 :</subfield>
      <subfield code="c">San Francisco, Calif.)</subfield>
    </datafield>

I would like a list of just the $a data from all 60x - 651 tags where the second indicator is 0. I was able to generate the following with the help of the above answer, but am not sure how to include a requirement that ind2 = 0.

forEach(value.parseXml().select("datafield[tag~=(600|610|611|630|650|651)] subfield[code=a]"),x,x.ownText()).join("|")

Everything I have tried to extend the above to only return the $a when the second indicator is 0 has failed. Thank you for any thoughts or guidance on how to work with marcxml indicators!

Hi @christie just a side comment, many of the WorldCat Metadata API endpoints can return JSON instead of XML, as an example /worldcat/search/brief-bibs/{oclcNumber}, which might make things "slightly" easier. Are you retrieving your local bibs? Which endpoint URL are you using to make requests?

Looks like Jsoup parses the attributes one after the other, so this seems to work, where I simply put the next attribute [ind2=0] after the [tag~=] attribute:

forEach(value.parseXml().select("datafield[tag~=(600|610|611|630|650|651)][ind2=0] subfield[code=a]"),x,x.ownText()).join("|")

Example adding a 3rd [ind1=2] or [ind1= ] attribute for <datafield> tags to match against:

forEach(value.parseXml().select("datafield[tag~=(600|610|611|630|650|651)][ind2=0][ind1=2] subfield[code=a]"),x,x.ownText()).join("|")

forEach(value.parseXml().select("datafield[tag~=(600|610|611|630|650|651)][ind2=0][ind1= ] subfield[code=a]"),x,x.ownText()).join("|")

This was the reference I used: Selector: jsoup HTML Parser Documentation

1 Like

Thank you. I had the spacing wrong in all of my attempts at getting to this syntax. I was able to get the following to work.

forEach(value.parseXml().select("datafield[tag~=(600|610|611|630|650|651)][ind2=0] subfield[code=a]"),x,x.ownText()).join("|")

To answer your other questions, I am using /worldcat/manage/bibs/ to get the MARC record. For this project, which is a retrospective cataloging project to update older cataloging records, we are interested in including cataloging information (level of cataloging, for instance), which I don't think is possible with the search API. And, I think that the only source for subject information and physical description in the Metadata API is to pull the full marc record. This is work that had been done before using the Worldcat Search API v.1.0 without the ability to limit based on cataloging information. I am still figuring out what is possible with the various OCLC APIs.

Thank you again for your help. I really appreciate it.

Christie

1 Like

This is phenomenal! I cannot thank you enough. WOW. This helps so much.

1 Like