Issue augmenting Wikidata QIDs in OR

Hi all - so nice to find the OR Discourse :slightly_smiling_face:, & helpful to look through some of the existing issues on this topic - I have read around the issue I’m having & initially wondered whether it could be a proxy issue as described here.

I am having issues using the reconciliation feature to pull down some extra data on a column of Wikidata QIDs. The eventual workflow will be to pull down each QID’s equivalent VIAF ID, and then to run another round of reconciliation to pull properly formed corporate names from VIAF. When I try reconciling the column of QIDs, either by ReconcileUse values as identifiers, or by Edit columnAdd columns from reconciled values, I only get the opportunity to match my values to Qid values, with no other Wikidata metadata listed:

The reconciliation of my QIDs works fine but I’m struggling to pull down the extra fields I’m looking for; any help much appreciated, I’m not convinced it’s a proxy issue (as I can successfully reconcile QIDs into their matching values), but can’t augment or pull up a list of associated values like VIAF IDs. Thanks for your help & hoping everyone well here,

MSH

Hi and welcome @MSH

Usually at the screen that you are displaying in the screenshot I’d expect to type in a property name in the “Add Property” box and for that to show available properties - what happens if you type into that box.

I’ve included some screenshots from an example


If this isn’t working please give some more details about the project (if possible sharing the whole project here is great, but if not some example data and screenshots of the actual project would be helpful) and which version of OpenRefine you are using - the more information you can share the easier it is for people to help.

Thanks and best wishes

Owen

1 Like

Just to be clear - the process for adding data from Wikidata to your project would be:

  1. reconcile an existing column in your project to Wikidata entities
  2. use “Add columns from reconciled values” to pull additional data down from Wikidata entities

You have to have first reconciled something in your existing project to do the second step.

You mention the ReconcileUse values as identifiers function and I just want to clarify that this is essentially a shortcut where you already have the IDs from a service that supports reconciliation (e.g. a list of Wikidata IDs) and you just want to skip the full reconciliation process and assert that the identifiers you have should be treated as already reconciled (this is a very useful shortcut but it does skip any quality assurance on the process - even if the IDs don’t exist on the target service they will be marked as reconciled with this approach).

2 Likes

Hey Owen - thank you so much for this !, especially for the prompt reply while we are all getting back up to speed - hope the New Year OK so far. In response to the first reply: many thanks, this is exactly the issue I was having, & as soon as I put in VIAF as a prompt I could map the IDs; apologies, I’d seen a few screenshots that made me think I’d get an index of columns straight away but this was an easy fix.

Re: the second post, I’m making sure to reconcile in advance but find the Use values as identifiers shortcut very helpful, particularly given the process of reconciling against no particular type given there is some variety in the QIDs. As it stands I’ve followed the correct sequence in your post & successfully generated VIAF IDs where they exist, and have a new column to reconcile & augment with the names I’m looking for. At the moment, when trying to pull down data from these reconciled values I run into the “This reconciliation service does not support data extension…” error; I have updated the VIAF service URL to https://refine.codefork.com/reconcile/viaf, per codefork, & have also tried using the LC proxy service with similar results:

At a glance these reconciled values don’t seem as firm as the Wikidata QIDs - I still have some options to confirm the matches - but nonetheless it doesn’t seem like I can continue to query the VIAF server. Happy to upload a sample of the dataset if easier to look this over, & FWIW all of this is happening in 3.6.2 having just updated. Thank you again for helping me over the reconciliation hurdles, & no worries if no obvious path forward beyond pulling down the VIAF #s.

Hi @MSH

Yes - unfortunately I’m not aware of any VIAF reconciliation service that supports the “data extension” part of the reconciliation API - which is the bit that OpenRefine uses to do the “Add columns from reconciled values” step.

While it doesn’t cover every reconciliation service running in the world, the ‘testbench’ at Reconciliation service test bench will show you which parts of the API is supported by various services - the “Extend data” (last column) is the one required for the “Add columns from reconciled values” function.

However - all is not lost :slight_smile: You can still add data from VIAF to your project, it just takes a couple more steps and a little more effort. The approach you can take is “Add column by fetching URLs” to fetch data from VIAF (there are various different formats supported by VIAF for this), and then extract in the information you need from the response you get (which will depend on which URL/format you request from VIAF).

As an example - you can use the following steps:

  1. From a column containing a VIAF ID, use “Add column by fetching URLS”
  2. In the expression box use GREL "https://viaf.org/viaf/"+value+"/justlinks.json"
  3. This should populate a new column with some coded content using the JSON format - which will look a bit like:
    "BIBSYS":["90251896"],
    "BNF":["http://catalogue.bnf.fr/ark:/12148/cb12265567w"],
    "CAOONL":["ncf11115758"],
    "DNB":["http://d-nb.info/gnd/6128261-3"],
    "ISNI":["0000000404235368"],
    "J9U":["987007257563005171"],
    "LC":["n79009603"],
    "NII":["DA06530721"],
    "NKC":["xx0131303"],
    "NLA":["000035721919"],
    "Identities":["https://www.worldcat.org/identities/lccn-n79009603"]}
  1. On your new column which contains the JSON, do “Add column based on this column”
  2. In the expression box use the GREL value.parseJson().LC.join("|")

The outcome would be a new column with the LC NAF ID (in this case) in it

Depending on which information you want from VIAF you may need to request a different URL (for example using the suffix /marc21.xml instead of /justlinks.json will get you the full VIAF record in MARCXML) and then the extraction of the information from the data would be different (I’ve used probably the simplest example above).

Please feel free to post more about exactly which information you need from VIAF and I can provide some guidance on which format might be best and how to extract the right data from the response (which is not always straightforward)

Owen

2 Likes

Hi again @ostephens - thank you for this!, both flagging the issue with the VIAF service & walking me through the workaround; much appreciated. I went back to an old OR training video recently & was delighted to discover the JSON parsing you mentioned - had never played with this but am quite often looking for a single value in an otherwise busy JSON object, having this quick GREL extract code is incredibly helpful.

Have successfully managed to pull out the LC code from VIAF following the instructions above - interestingly (/annoyingly, please feel free to leave me puzzling with this one as the complications pile up) this prompts another step of reconciliation as I’m eventually looking to return a properly formed name, with the chain of transformations looking like:

Wikidata QID → VIAF ID → LC nID → corresponding LC name

In this example I’d query the nID n81032917 and return ‎‡a Fields, Gracie,‏ ‎‡d 1898-1979; there isn’t an obvious service for this on the test bench list but I’m sure a variety of them can accomplish it, & the Library of Congress has had various Linked Data APIs operating for a while that I can try to work with. Great to see a few of the reconciliation services on the library & archives side (like SNAC) building OpenRefine into their plans & will try to chronicle any progress here if no further developments :slight_smile:

Ooh, this is a really helpful tidbit! Thanks!

2 Likes

In this example I’d query the nID n81032917 and return ‎‡a Fields, Gracie,‏ ‎‡d 1898-1979; there isn’t an obvious service for this on the test bench list but I’m sure a variety of them can accomplish it, & the Library of Congress has had various Linked Data APIs operating for a while that I can try to work with.

By changing the extension on the base URL (https://id.loc.gov/authorities/names/n81032917), you can get a variety of different formats, including JSON, NT, etc, You can also use content negotiation headers, but changing the URL is probably easier. Because it's MADS/SKOS RDF, it's kind of a mess to parse as JSON, but the info is there.

Actually, on second thought, the MARC XML format might be the easiest to handle - https://id.loc.gov/authorities/names/n81032917.marcxml.xml

Tom

2 Likes

The LC NAF (and other LoC linked data services) also has an “interesting” lightweight way to get the preferred label using just the HEAD method. Unfortunately, I don’t think it can be used in OpenRefine, but folks using Python or other languages/tools might be able to take advantage.

curl -Is [https://id.loc.gov/authorities/names/n81032917](https://id.loc.gov/authorities/names/n81032917) | grep x-preflabel
x-preflabel: Fields, Gracie, 1898-1979
x-preflabel-escaped: Fields, Gracie, 1898-1979
x-preflabel-encoded: Fields%2C%20Gracie%2C%201898-1979

The MARC XML format isn’t actually a delight to work with either, so here’s a cheat code to get the 100a from the payload returned by a .marcxml.xml URL:

value.parseXml().select('marcxml|datafield[tag=100] > marcxml|subfield[code=a]')[0].xmlText()

getting the d subfield is left as an exercise for the reader.:wink: Because the data is pretty constrained and unlikely to change format, you can also “cheat” a little to simplify things like:

value.parseXml().select('[tag=100] > [code=a]')[0].xmlText()

or get fancy and return all the 100 subfields joined by pipe (|) characters:

forEach(value.parseXml().select('[tag=100] > [code]'), subfield, subfield.xmlText()).join('|')

Tom

2 Likes

Ugh! The mail forwarder apparently can’t handle something as simple as monospaced code snippets in an HTML email. Sorry about the mess!

1 Like

the chain of transformations looking like:
Wikidata QID → VIAF ID → LC nID → corresponding LC name
In this example I’d query the nID n81032917 and return ‎‡a Fields, Gracie,‏ ‎‡d 1898-1979

As @tfmorris suggests it would be possible to do this via the id.loc.gov service, but possibly you can do it directly from VIAF as well which might save you a step.

So rather than using "https://viaf.org/viaf/"+value+"/justlinks.json" as I suggest above, instead you could use
"https://viaf.org/viaf/"+value+"/marc21.xml"

Using the Age UK example (VIAF ID = 158063944) this will return some XML starting like:

<mx:record xmlns:v="http://viaf.org/viaf/terms#" xmlns:srw="http://www.loc.gov/zing/srw/"
           xmlns:mx="http://www.loc.gov/MARC21/slim">
   <mx:leader>00000nz##a22.....n..4500</mx:leader>
   <mx:controlfield tag="001">viaf158063944</mx:controlfield>
   <mx:controlfield tag="003">OCoLC</mx:controlfield>
   <mx:controlfield tag="008">230102n||azznnabbn##########|a#aaa####|#</mx:controlfield>
   <mx:datafield tag="024" ind1="7" ind2=" ">
      <mx:subfield code="a">http://viaf.org/viaf/158063944</mx:subfield>
      <mx:subfield code="2">uri</mx:subfield>
   </mx:datafield>
   <mx:datafield tag="040" ind1=" " ind2=" ">
      <mx:subfield code="a">VIAF</mx:subfield>
      <mx:subfield code="c">VIAF</mx:subfield>
   </mx:datafield>
   <mx:datafield ind1="2" ind2=" " tag="710">
      <mx:subfield code="a">Age Concern England</mx:subfield>
      <mx:subfield code="0">(ISNI)0000000404235368</mx:subfield>
   </mx:datafield>

There will be repeated 710 fields (mx:datafield with tag = 710) in this XML and you can use the GREL parseXml() to pull out the LC one. You probably already know, but the 710 field is specific for corporate authors/organisations, and if you were to do the same thing for a person (e.g. Gracie Fields) then you'd be looking for a 700 field instead.

There are going to be a variety of ways to extract the data you want from the XML, but a couple of examples might be:

filter(value.parseXml().select("mx|datafield[tag=710]"),x,x.select("mx|subfield[code=0]")[0].ownText().startsWith("(LC"))).join("")

and

forEach(value.parseXml().select("mx|subfield[code=0]"),x,if(x.ownText().startsWith("(LC)"),x.parent(),null)).join("")

The first of these specifically looks for the 710 fields and then filters down to those with "(LC)" in the subfield zero, whereas the second will extract any field with a subfield zero that starts with "(LC)" no matter what the MARC field. There are many variations on this theme for selecting the appropriate fields and @tfmorris's examples contain some alternative select() statements you could also use as the basis of the xml parsing.

Once you have it down to the relevant fields you can extract the various subfields using parseXml() select() and ownText() appropriately. For example, from the previous step if we get the output:

<mx:datafield ind1="2" ind2=" " tag="710">
      <mx:subfield code="a">Age Concern England (Organization)</mx:subfield>
      <mx:subfield code="0">(LC)n  79009603</mx:subfield>
   </mx:datafield>	

then we can extract the subfield a by:

value.parseXml().select("mx|subfield[code=a]")[0].ownText()

Parsing the XML requires some slightly more involved processing than the justlinks.json but hopefully between the examples given by @tfmorris and the ones I've offered here there is enough to show how it can be done. Please do ask if there's anything more I can clarify.

Owen

2 Likes

Thank you @tfmorris!, this is wonderful - what a great tour through all the different outputs; both the XML & JSON are a little unwieldy but contain all the data I needed, & the regular expressions meant I could somewhat clumsily try to pare down to just the name values. Quick shout to the old OR Google Group archives where I found a few others working with similar data. I made it as far as extracting some of the fields from the LC data while doing some trial & error around the correct GREL syntax for fields like datafield tag=“100”.

After this, I’ve gone back to Owen’s great final steps, & can successfully return the 710/700 values I want using the second of the two extract snippets above, all from the original VIAF. Depending on what these look like when I tackle a series of personal rather than corporate names I might go back & join a few of the split subfields, but generally things are looking great.

Thank you both for the extremely patient help on this! We’re cataloguing some websites over here & the lack of controlled data on authors or issuing organisations can be a big limitation, so being able to work this up from fairly easily assigned Wikidata IDs is a great step :slight_smile:. Thanks for thinking this through from start to end; I’m writing up the workflow we’ve used so we can see where else it might be useful - will post any further updates here. TY again & TY for continued work making OR so useful.

1 Like

| MSH
January 6 |

  • | - |

Quick shout to the old OR Google Group archives where I found a few others working with similar data.

Glad you found my decade old VIAF answer useful. :grin: I hope we don't lose that 12+ year trove of information!

Tom

1 Like