XSLT 1.0 based templating/export

psm · August 27, 2023, 7:50pm

Dear all

We are in the process of generating a MARC 21 Authority format-based version of Homosaurus (https://homosaurus.org/, version 3.4) for our local use in an ILS.

It is possible to fetch datasets from Homosaurus in MARC format based on term ID, e.g., https://homosaurus.org/v3/homoit0001670.marc, but it produces a bare-bones MARC record without even producing tag 450 (a see' reference based on skos:altLabel). So we decided to try something else.
The JSON-Extended format of Homosaurus is quite rich in providing data elements, e.g., https://homosaurus.org/v3/homoit0002606.json, and included almost all elements we needed for a comprehensive authority file;
In MarcEdit, we are using an XSLT 1.0-based converter (developed based on https://github.com/reeset/marcedit_xslt_files/blob/master/homosaurus_xml.xsl) to take each JSON record and then convert it into MARC authority records. It is working nicely, and we added all required data elements to the modified XSLT file. But it takes one record at a time.
We fetched all JSON-Extended format-based records in OpenRefine from Homosaurus and are in the process of generating MARC-XML records by using GREL (which is slightly painful).

In this situation, I was wondering whether we could use that XSLT file in the templating facility of OpenRefine. Our need is something like this: it will read each record in JSON-Extended format and use that XSLT file to produce a MARC-XML record.

If it is not possible through templating, is there any other way to do it?

Regards

ostephens · September 3, 2023, 5:59pm

Hi @psm and sorry for the delay in responding. The OpenRefine templating export might offer a way to achieve what you need but you'd have to build the template by hand, and the JSON isn't going to be that easy to handle from what I can see - which might make it no less painful that your current approach of creating it from GREL.

I wonder if this post from Terry Reese (who created and maintained the MARC edit software) was a response to a question from you? It seems to describe exactly what you want to do, but possibly you've found it ineffective? How do I generate MARC authority records from the Homosaurus vocabulary? – Terry's Worklog

If I was going to approach this in OpenRefine I suspect my approach would be to use the MARC XML representation provided by Homosaurus as the starting point, and then enhance it by adding data that is missing from the JSON. That way I'd only have to write sufficient GREL to add data rather than having to do the full transformation that way. If this is of interest let me know and I'll add some more details

Hope this helps

Owen

psm · September 5, 2023, 8:40am

Thanks @ostephens for showing us the path.

I've gone through that post you referred to here (asked by someone else), which was concerned with the conversion of the JSON-LD format of the Homosaurus. We adopted the same approach for the JSON-Extended format, as this format of Homosaurus is presently the most rich one in terms of data elements (only links for BT/NT in JSON-LD vs. term+link for BT/NT in JSON-Extended). We modified the original XSLT to include some additional tags like 001, 003, 005, 450, 750, etc. This is giving us the desired results for each term, one at a time. Batch processing of all 5K+ terms is not possible in MarcEdit.
We attempted the approach you are suggesting in the following way: a) Fetch MARC-XML based on term ID of Homosaurus; b) Spit off the < record> < /record> tags; c) Fetch JSON-Extended to include missing element in MARC-XML format (column marcXml2 here); d) Add < collection> < record> .... < /record> < /collection> tags through templating during export to MARC-XML format compatible to MarcEdit; and e) Add tag 001, 003, 005 (date & time stamp) etc for each record in MarcEdit end. Are we in the right way?

Thanks and regards

Partha

ostephens · September 5, 2023, 1:02pm

You should be able to modify the xslt to process multiple JSON records. If you create a JSON array of the homosaurus records - you could do this with OpenRefine or other batch processing.

Terry Reese is probably in the best place to advise on modifying the xslt to handle multiple records, but my attempt, which seems to work, is below.

Example xslt:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
  xmlns:dc="http://purl.org/dc/elements/1.1/"
  xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
  xmlns:skos="http://www.w3.org/2004/02/skos/core#"
  xmlns="http://www.loc.gov/MARC21/slim"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

  <xsl:output method="xml" indent="yes" />

  <xsl:template match="/">
    <collection xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:schemaLocation="http://www.loc.gov/MARC21/slim http://www.loc.gov/standards/marcxml/schema/MARC21slim.xsd">
      <xsl:apply-templates />
    </collection>
  </xsl:template>


  <xsl:template match="record">
    <record xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:schemaLocation=" http://www.loc.gov/MARC21/slim http://www.loc.gov/standards/marcxml/schema/MARC21slim.xsd">
      <leader>00596nz a2200217n 4500</leader>

      <xsl:variable name="mod_date">
        <xsl:choose>
          <xsl:when test="//modified">
            <xsl:variable name="holddata" select="// modified/ value" />
            <xsl:value-of
              select="substring(translate($holddata,'-' ,'' ),3,6)" />
          </xsl:when>
          <xsl:otherwise>
            <xsl:text>210101</xsl:text>
          </xsl:otherwise>
        </xsl:choose>

      </xsl:variable>
    "
      <!-- Control field 008 needs processed-->
      <controlfield tag="008">
        <xsl:value-of select="$mod_date" />|||anznnbab||||||||||||||a|||||||d </controlfield>

      <xsl:if
        test="//identifier">
        <datafield tag="024" ind1="8" ind2=" ">
          <subfield code="a">
            <xsl:value-of select="//identifier" />
          </subfield>
          <subfield code="0">
            <xsl:value-of select="//id" />
          </subfield>
        </datafield>
      </xsl:if>

      <!--***************************************************************
          * To add your cataloging institution code to the 040
          * use the <subfield code="[yoursubfied">[your data]</subfield>
          * template.  
          * for example; modify the below to use Ohio State would look like
          <datafield tag="040" ind1=" " ind2=" ">
             <subfield code="a">OSU</subfield>
             <subfield code="f">homosaurus</subfield>
             <subfield code="c">OSU</subfield>
          </datafield>
          *****************************************************************
      -->
      <datafield
        tag="040" ind1=" " ind2=" ">
        <subfield code="f">homosaurus</subfield>
      </datafield>


      <!--**************************************************************
          * At this point (4/3/2021), the vocabulary does not provide a 
          * conceptual definition between topical vocab elements versus
          * genre elements.  In MARC, this is an important distinction
          * however, to represent the vocabulary faithfully, unless this 
          * distinction is coded into the terms, applying a genre 
          * context would be inferrence and would arguably no longer
          * faithfully representing the vocabulary or intended use.
          * Until the vocabulary explains variations in Concept, 
          * all terms are treated as topics.
      -->
      <xsl:if test="//prefLabel">
        <datafield tag="150" ind1=" " ind2=" ">
          <subfield code="a">
            <xsl:value-of select="//prefLabel" />
          </subfield>
        </datafield>
      </xsl:if>

      <xsl:for-each
        select="//altLabel">
        <datafield tag="450" ind1=" " ind2=" ">
          <subfield code="a">
            <xsl:value-of select="." />
          </subfield>
        </datafield>
      </xsl:for-each>
      
      <xsl:for-each
        select="//hasTopConcept">
        <datafield tag="550" ind1=" " ind2=" ">
          <xsl:if test="./prefLabel">
            <subfield code="a">
              <xsl:value-of select="./prefLabel" />
            </subfield>
          </xsl:if>
          <subfield code="0">
            <xsl:value-of select="./id" />
          </subfield>
        </datafield>
      </xsl:for-each>

      <xsl:for-each
        select="//broader">
        <datafield tag="550" ind1=" " ind2=" ">
          <xsl:if test="./prefLabel">
            <subfield code="a">
              <xsl:value-of select="./prefLabel" />
            </subfield>
          </xsl:if>
          <subfield code="0">
            <xsl:value-of select="./id" />
          </subfield>
        </datafield>
      </xsl:for-each>

      <xsl:for-each
        select="//comment">
        <datafield tag="680" ind1=" " ind2=" ">
          <subfield code="a">
            <xsl:value-of select="." />
          </subfield>
        </datafield>
      </xsl:for-each>

    </record>
  </xsl:template>
</xsl:stylesheet>

Example of two Homosaurus JSON LD records in an array

[
    {
        "@context": {
            "skos": "http://www.w3.org/2004/02/skos/core#",
            "dc": "http://purl.org/dc/terms/",
            "rdfs": "http://www.w3.org/2000/01/rdf-schema#",
            "xsd": "http://www.w3.org/2001/XMLSchema#"
        },
        "@id": "https://homosaurus.org/v3/homoit0002607",
        "dc:identifier": "homoit0002607",
        "skos:prefLabel": "Lesbian erotic art ",
        "rdfs:comment": "Erotic art, created by lesbian artists, depicting lesbian people, or that are regarded as important within lesbian culture.",
        "dc:issued": {
            "@type": "xsd:date",
            "@value": "2023-04-14"
        },
        "dc:modified": {
            "@type": "xsd:date",
            "@value": "2023-04-14"
        },
        "skos:exactMatch": {
            "@id": "http://id.loc.gov/authorities/subjects/sh96007206"
        },
        "@type": "skos:Concept",
        "skos:inScheme": {
            "@id": "https://homosaurus.org/v3"
        },
        "skos:broader": [
            [
                {
                    "@id": "https://homosaurus.org/v3/homoit0002579",
                    "skos:prefLabel": "LGBTQ+ erotic art"
                }
            ]
        ],
        "skos:hasTopConcept": [
            [
                {
                    "@id": "https://homosaurus.org/v3/homoit0002607",
                    "skos:prefLabel": "Lesbian erotic art "
                }
            ]
        ],
        "skos:narrower": [
            [
                {
                    "@id": "https://homosaurus.org/v3/homoit0002546",
                    "skos:prefLabel": "Lesbian erotic films"
                }
            ],
            [
                {
                    "@id": "https://homosaurus.org/v3/homoit0000765",
                    "skos:prefLabel": "Lesbian porn films"
                }
            ]
        ],
        "skos:related": [
            [
                {
                    "@id": "https://homosaurus.org/v3/homoit0002599",
                    "skos:prefLabel": "Lesbian art "
                }
            ],
            [
                {
                    "@id": "https://homosaurus.org/v3/homoit0002599",
                    "skos:prefLabel": "Lesbian art "
                }
            ]
        ]
    },
    {
        "@context": {
            "skos": "http://www.w3.org/2004/02/skos/core#",
            "dc": "http://purl.org/dc/terms/",
            "rdfs": "http://www.w3.org/2000/01/rdf-schema#",
            "xsd": "http://www.w3.org/2001/XMLSchema#"
        },
        "@id": "https://homosaurus.org/v3/homoit0002606",
        "dc:identifier": "homoit0002606",
        "skos:prefLabel": "Two-Spirit erotic art ",
        "skos:altLabel": [
            "2S erotic art ",
            "Two Spirit erotic art ",
            "Two Spirited erotic art ",
            "Two-Spirited erotic art "
        ],
        "rdfs:comment": "Erotic art, created by Two-Spirit artists, depicting Two-Spirit people, or that are regarded as important within Two-Spirit culture.",
        "dc:issued": {
            "@type": "xsd:date",
            "@value": "2023-04-14"
        },
        "dc:modified": {
            "@type": "xsd:date",
            "@value": "2023-04-14"
        },
        "@type": "skos:Concept",
        "skos:inScheme": {
            "@id": "https://homosaurus.org/v3"
        },
        "skos:broader": [
            [
                {
                    "@id": "https://homosaurus.org/v3/homoit0002579",
                    "skos:prefLabel": "LGBTQ+ erotic art"
                }
            ]
        ],
        "skos:hasTopConcept": [
            [
                {
                    "@id": "https://homosaurus.org/v3/homoit0002606",
                    "skos:prefLabel": "Two-Spirit erotic art "
                }
            ]
        ],
        "skos:narrower": [
            [
                {
                    "@id": "https://homosaurus.org/v3/homoit0002555",
                    "skos:prefLabel": "Two-Spirit porn films"
                }
            ],
            [
                {
                    "@id": "https://homosaurus.org/v3/homoit0002545",
                    "skos:prefLabel": "Two-Spirit erotic films "
                }
            ]
        ],
        "skos:related": [
            [
                {
                    "@id": "https://homosaurus.org/v3/homoit0002597",
                    "skos:prefLabel": "Two-Spirit art "
                }
            ],
            [
                {
                    "@id": "https://homosaurus.org/v3/homoit0002597",
                    "skos:prefLabel": "Two-Spirit art "
                }
            ]
        ]
    }
]

psm · September 6, 2023, 12:00pm

Thanks @ostephens for this suggestion.

It is now handling batch processing effectively and reporting successfully how many records it is producing at MarcEdit end, but there seems a problem here in the final format (mrc or mrcx). This batch processing is producing same 150 for records (I tested with your two records and mine 10 records) and tags 450, 550, 680 etc are same for all exported records. The example by using your records has produced:

First record

=LDR  01030nz aa2000241 4500  
=008  230414|||anznnbab||||||||||||||a|||||||d\
=024  8\$ahomoit0002607$0https://homosaurus.org/v3/homoit0002607
=040  \\$fhomosaurus
=150  \\$aLesbian erotic art 
=450  \\$a2S erotic art 
=450  \\$aTwo Spirit erotic art 
=450  \\$aTwo Spirited erotic art 
=450  \\$aTwo-Spirited erotic art 
=550  \\$0
=550  \\$aLesbian erotic art $0https://homosaurus.org/v3/homoit0002607
=550  \\$0
=550  \\$aTwo-Spirit erotic art $0https://homosaurus.org/v3/homoit0002606
=550  \\$0
=550  \\$aLGBTQ+ erotic art$0https://homosaurus.org/v3/homoit0002579
=550  \\$0
=550  \\$aLGBTQ+ erotic art$0https://homosaurus.org/v3/homoit0002579
=680  \\$aErotic art, created by lesbian artists, depicting lesbian people, or that are regarded as important within lesbian culture.
=680  \\$aErotic art, created by Two-Spirit artists, depicting Two-Spirit people, or that are regarded as important within Two-Spirit culture.

second record

=LDR  01030nz aa2000241 4500  
=008  230414|||anznnbab||||||||||||||a|||||||d\
=024  8\$ahomoit0002607$0https://homosaurus.org/v3/homoit0002607
=040  \\$fhomosaurus
=150  \\$aLesbian erotic art 
=450  \\$a2S erotic art 
=450  \\$aTwo Spirit erotic art 
=450  \\$aTwo Spirited erotic art 
=450  \\$aTwo-Spirited erotic art 
=550  \\$0
=550  \\$aLesbian erotic art $0https://homosaurus.org/v3/homoit0002607
=550  \\$0
=550  \\$aTwo-Spirit erotic art $0https://homosaurus.org/v3/homoit0002606
=550  \\$0
=550  \\$aLGBTQ+ erotic art$0https://homosaurus.org/v3/homoit0002579
=550  \\$0
=550  \\$aLGBTQ+ erotic art$0https://homosaurus.org/v3/homoit0002579
=680  \\$aErotic art, created by lesbian artists, depicting lesbian people, or that are regarded as important within lesbian culture.
=680  \\$aErotic art, created by Two-Spirit artists, depicting Two-Spirit people, or that are regarded as important within Two-Spirit culture.

By the way, there is one ' " ' before the starting of the control field 008:

</xsl:variable>
    "
      <!-- Control field 008 needs processed-->

However, the result is the same with or without it.

What am I missing here?

Regards

Partha

ostephens · September 6, 2023, 1:34pm

@psm I'm pretty sure that the problem is my xslt not quite being correct - I think the paths in the various "select" statements need tweaking - but I think this question would be better addressed either directly to Terry Reese of MarcEdit or on the MarcEdit email list http://listserv.gmu.edu/cgi-bin/wa?A0=marcedit-l as that community is in a better position to provide help on this specific question.

If you want more input on the OpenRefine side of things then I think what you describe above suggests you are on the right path if you want to do it that way (although my guess is that the MarcEdit XSLT route will ultimately be easier and more flexible in this case)

Owen

psm · September 6, 2023, 3:58pm

Thanks, @ostephens; I will catch up on the list you advised.

Heartfelt regards

Topic		Replies	Views
Meeting with the DNB community Community	5	132	April 19, 2024
Axiell - using OpenRefine to clean data - typical workflow recipes? Data cleaning and transformations	13	652	December 19, 2022
HTML parsing (MARC data) Support and Helpdesk	6	526	July 13, 2023
Preserve record mode hierarchy for export / templating Support and Helpdesk	4	64	December 30, 2024
Transpose columns to rows Support and Helpdesk wikidata , reconciliation	2	341	December 14, 2022

XSLT 1.0 based templating/export

Related topics