Meeting with the DNB community

I had a quick call with people (including @Michael_Markert) who contribute to the authority database of the German national library, GND (Gemeinsame Normdatei). They are interested in exploring ways in which OpenRefine could be equipped with an integration similar to the Wikibase one, but for populating the GND instead.

OpenRefine is already used in such workflows, in combination with the lobid-GND reconciliation endpoint, to do the initial matching. To export the data to GND, one approach is to generate MARC data from OpenRefine using the templating exporter (I assume in XML format, but I am not sure). One problem with this is that it does not come with any quality checks or even validation that the resulting file is indeed valid MARC. So, a better way to export to MARC would be potentially useful. The inspiration from the Wikibase schema editor (to specify how to map data to which MARC fields) and its issues tab (to validate the translated data) could be followed.

Other approaches that were mentioned include:

  • adding support for data upload in the reconciliation API itself
  • exporting from OpenRefine to other formats, such as RDF

I mentioned that we are interested in supporting use cases like this one and that we are thinking how to improve our extension system to make it easier to develop such integrations outside of OpenRefine's code base.

There will probably be a follow-up meeting. The conversation was in German but I am sure we could switch to English if needed. Is anyone interested in discussing this? I have mentioned that we have the BarCamp coming up, where such discussions could also take place.

See also this thread on metadaten.community: GND-Updates aus OpenRefine - Gemeinsame Normdatei (GND) - metadaten.community

1 Like

Great summary, @antonin_d!

Was there any discussion of using MARCedit's MARC Breaker format as an intermediary format or its MARCValidator for the validation? MARCedit is a well established and respected tool which is 100% focused on MARC.

MARC is a dead/dying format (being replaced by BIBFRAME/FRBR/LRM/...), so investing in new development for it may not provide the best return on investment. Aiming for one of the future RDF-based library formats may provide for more longevity. More generally, but also on the (much) longer term, it may be worth considering whether there is a need for some type of standardized synchronization / update mechanism. We've got matching and fetching protocols which cover 2/3 of the space. Google Refine had a mechanism to submit triples/quads to the "Refinery" for ingestion into Freebase and we've got a Wikidata specific update mechanism, but perhaps it's worth considering what a general solution would look like.

I've helped OpenLibrary a fair bit with their MARC handling and one of the things I've learned is that there is a lot of idiosyncratic MARC usage in the field. The MARCXML vs MARC 21 question is an important detail because the original MARC 21 is a baroque binary format which is much more difficult to deal with (and can use ancient MARC-8 character encoding). The MARC Breaker format is even simpler to deal with than MARCXML.

I don't speak (or read) German, but am happy to offer advice where I can, particularly if they have a description of the requirements that they're trying to satisfy.

Tom

Was there any discussion of using MARCedit's MARC Breaker format as an intermediary format or its MARCValidator for the validation? MARCedit is a well established and respected tool which is 100% focused on MARC.

I don't think it was, but to be honest I'm so unfamiliar with the MARC ecosystem that I may well have missed it.

Aiming for one of the future RDF-based library formats may provide for more longevity.

That's intuitively the very purpose of RDF. OpenRefine could decide to just invest in a great RDF exporter (the RDF Transform extension being likely the best candidate), through which multiple communities could export to their own platforms: exporting to GND via RDF, to Wikidata also via RDF, to GraphDB via RDF of course…

In the case of Wikidata / Wikibase, the reason why I didn't go down that route is that I felt it would be much less user friendly, because it would require from users more knowledge of how the Wikibase data model is translated to RDF. ("You want to import dates in Wikidata? Then go and learn how Wikibase dates are serialized in RDF and come back!") Also, there is no way to "make edits to Wikidata by importing RDF", because the translation only goes in the other direction. (So far at least! I remember an attempt to build something like this for a new version of the primary sources tool in Wikidata, which didn't see the light of day in the end).

I suspect that if people currently use the templating exporter to generate MARC instead of RDF, it's perhaps for similar reasons: they want to generate something that's closer to the actual data model of the database they want to contribute to, not just a potentially lossy or convoluted view of it. But that's me speculating.

perhaps it's worth considering what a general solution would look like.

If we could build something generic, for sure the potential would be pretty big! Developing and maintaining custom integrations is definitely costly.

In my opinion it's quite challenging because when writing data into a database it feels crucial to me that the data is expressed in a format that's as faithful as possible to the native data format of the database.

The reconciliation API has reached this sort of general usefulness (applicability to many databases) and it is lossy: it coerces the database's data model into a fairly basic one, with entities, properties and types. But I think this loss of expressiveness is sort of acceptable for matching (although it's not ideal: for instance, people want to fetch Wikidata qualifiers via the recon service, and that's not supported). For editing, I think that would be a lot more restrictive. That's why I am not convinced by the idea of adding write support to the reconciliation API. That's also the core reason why I think building a generic solution to this data import problem is hard, even outside of the recon API.

I believe that it would be useful to have a stand-alone feature that allows validation against arbitrary rules, whether it be for integrity checks or rule validation. We could save the set of rules as a template, making it easy to apply them when preparing data for import into a specific system.

We'd actually HELP the GLAM community by not doing much at all with MARC21 directly because indeed it's dying and should. The library community and vendors already embrace MARCXML and JSON primarily. (Terry Reese himself has already invested in ensuring MarcEdit can do metadata translations to JSON, JSONLD, XML, Bibframe. And exporting as tab-delimited, and validation, and, and, and...)

The librarians that I've helped use both programs, MarcEdit and OpenRefine, where they do not have a need to export MARC records in OpenRefine, it's always been JSON or CSV. They use MarcEdit for all the heavy lifting of getting OUT of MARC, validate, and export into their new formats. OpenRefine export format is JSON, and they use MarcEdit to convert JSON to MARC (sometimes JSON to XML, depending on their target DB). Crystal Clements at UW.edu was one of them doing workflows such as this.

MarcEdit's main translation is MARCXML (where he heavily uses XSLT behind the scenes), but has slowly evolved into using more JSON. The reason MARCXML is the main translation is because many ILS's (all now?) ( like Koha, Symphony, etc.) use MARCXML as an import/export format for records.

Some don't use MarcEdit at all, and instead just use services like OCLC's WorldShare Record Manager which imports/exports MARC21, MARCXML and incidentally, has GND authority support already built in as well.

But I'm out of the library world now, so couldn't help much, but can point to folks that are like Antoine Isaac or perhaps Esther Chen at Max-Planck.