HTML parsing (MARC data)

I have a dataset resulted from web-scraping from a library catalogue database. The structure is like this:

<body ID="opac-showmarc" class="branch-default" >

<div id="main">
<pre>LDR 01090n   a2200277 a 4500
000     BRB                 
001     454713
005     20230303095821.0
020    _a
       _cरु. 60.00
       _6880-00/(B
040    _aCRL, Kolkata
041    _amar
082    _a891.462
       _b
084    _aO155,2
100    _aकोटस्थाने, रमेश
       _b
       _c
       _d
       _6880-01/(B
245    _aरंग माझा वेगळा - मंत्र सुखाचा
       _b
       _c
       _6880-02/(B
260    _aपुणे
       _bनीहारा प्रकाशन
       _cएप्रिल 2021
       _6880-03/(B
300    _a48 पृ.
       _bचित्रे
       _c22 सें. मी.
       _e
       _6880-04/(B
500    _aदोन बालएकांकिका.
       _6880-05/(B
880    _cRu. 60.00
       _6020-00/(B
880    _aKoṭasthāne, Rameś
       _6100-01/(B
880    _aRaṅg mājhā vegaḷā - mantra sukhācā
       _6245-02/(B
880    _aPuṇe
       _bNīhārā Prakāśan
       _cEpril 2021
       _6260-03/(B
880    _a48 pṛ.
       _bcitre
       _c22 seṁ. mī.
       _6300-04/(B
880    _aDon bālekāṅkikā.
       _6500-05/(B
965    _aDB 58816
999    _aMR21590
       _c348001
       _d348001</pre>
</div>
</body>
</html>

A GREL like this value.parseHtml().select(“pre”)[0].htmlText() giving me output as:


LDR 01090n   a2200277 a 4500
000     BRB                 
001     454713
005     20230303095821.0
020    _a
       _cरु. 60.00
       _6880-00/(B
040    _aCRL, Kolkata
041    _amar
082    _a891.462
       _b
084    _aO155,2
100    _aकोटस्थाने, रमेश
       _b
       _c
       _d
       _6880-01/(B
245    _aरंग माझा वेगळा - मंत्र सुखाचा
       _b
       _c
       _6880-02/(B
260    _aपुणे
       _bनीहारा प्रकाशन
       _cएप्रिल 2021
       _6880-03/(B

How can I extract the data value from a tag like 082 (full) or a given sub field under it (082 _a)?

Regards

In this case the issue isn’t the HTML (which you’ve already successfully parsed and extracted the relevant content) but manipulating the MARC in the format that’s been embedded in the webpage

Unfortunately this isn’t incredibly straightforward because the MARC structure as represented here is tricky to work with. Assuming you have multiple MARC records extracted this way in you might be best off exporting just the column that contains the MARC data - from here you can do more manipulation in a text editor, use a specialist tool like MarcEdit or re-import to OpenRefine to work with the data in OR.

I posted a series on working with MARC data which includes using it in OpenRefine at fixmarc | Overdue Ideas. Your scenario is slightly different to the one I was dealing with in this series but it may give you some idea of how to proceed.

If you want to share some more detail (e.g. an example of where you are extracting the data from and what you need at the end) I might be able to offer more detailed advice

Dear Owen

Sorry for being late in response. I was trying the paths as advised by you.

I’m actually exploring the following - 1) Scrap the data from a library catalogue (details given below); 2) Use that dataset to extract a set of selected MARC tags (with content); and 3) prepare a text corpus containing title (245 tag), summary notes (5xx tags) in one column and subject indexing terms (650 tag) in another column.

The source site is here (unfortunately the OAI/PMH based metadata harvesting is not working for this union catalogue) - https://librarycatalogue.nvli.in/.

We can extract from here in the following ways -

type 1
https://librarycatalogue.nvli.in/cgi-bin/koha/opac-MARCdetail.pl?biblionumber=2402943

type 2 (html)					
https://librarycatalogue.nvli.in/cgi-bin/koha/opac-showmarc.pl?id=18687&viewas=html

type 3 (xml)
https://librarycatalogue.nvli.in/cgi-bin/koha/opac-showmarc.pl?id=18687&viewas=xml

Biblio numbers are sequential (id=) from 1 to 345,0500. In the earlier mail I produced results by following the type 3 data fetching.

The type 2 data fetching giving me the following results (example):

<html xmlns:marc="http://www.loc.gov/MARC21/slim">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>MARC View</title>
</head>
<body>

<table>
<tr>
<th style="white-space:nowrap">
					000
				</th>
<td colspan="2"></td>
<td>01320nam a2200325 i 4500</td>
</tr>
<tr>
<th style="white-space:nowrap">001</th>
<td colspan="2"></td>
<td>vtls001911996</td>
</tr>
<tr>
<th style="white-space:nowrap">003</th>
<td colspan="2"></td>
<td>NLI</td>
</tr>
<tr>
<th style="white-space:nowrap">005</th>
<td colspan="2"></td>
<td>20210807000205.0</td>
</tr>
<tr>
<th style="white-space:nowrap">007</th>
<td colspan="2"></td>
<td>cr cn|||||||||</td>
</tr>
<tr>
<th style="white-space:nowrap">008</th>
<td colspan="2"></td>
<td>160914t20142014ctuab   s     001 0 eng d</td>
</tr>
<tr>
<th style="white-space:nowrap">020</th>
<td> </td>
<td> </td>
<td>
<strong>_a</strong>9780300180312 </td>
</tr>
<tr>
<th style="white-space:nowrap">020</th>
<td> </td>
<td> </td>
<td>
<strong>_a</strong>9780300206197 (e-book) </td>
</tr>
<tr>
<th style="white-space:nowrap">039</th>
<td> </td>
<td>9</td>
<td>
<strong>_a</strong>201610171143<br><strong>_b</strong>gopag<br><strong>_y</strong>201609141656<br><strong>_z</strong>gopag </td>
</tr>
<tr>
<th style="white-space:nowrap">044</th>
<td> </td>
<td> </td>
<td>
<strong>_a</strong>ctu </td>
</tr>
<tr>
<th style="white-space:nowrap">100</th>
<td>1</td>
<td> </td>
<td>
<strong>_a</strong>Laband, John,<br><strong>_d</strong>1947-<br><strong>_9</strong>1270208 </td>
</tr>
<tr>
<th style="white-space:nowrap">245</th>
<td>1</td>
<td>0</td>
<td>
<strong>_a</strong>Zulu warriors :<br><strong>_b</strong>the battle for the South African frontier /<br><strong>_c</strong>John Laband. </td>
</tr>
<tr>
<th style="white-space:nowrap">260</th>
<td> </td>
<td> </td>
<td>
<strong>_a</strong>New Haven, Connecticut :<br><strong>_b</strong>Yale University Press,<br><strong>_c</strong>2014. </td>
</tr>
<tr>
<th style="white-space:nowrap">300</th>
<td> </td>
<td> </td>
<td>
<strong>_a</strong>1 online resource (390 p.) :<br><strong>_b</strong>ill., maps </td>
</tr>
<tr>
<th style="white-space:nowrap">500</th>
<td> </td>
<td> </td>
<td>
<strong>_a</strong>Includes index. </td>
</tr>
<tr>
<th style="white-space:nowrap">650</th>
<td> </td>
<td>0</td>
<td>
<strong>_a</strong>Zulu War, 1879.<br><strong>_9</strong>1270209 </td>
</tr>
<tr>
<th style="white-space:nowrap">650</th>
<td> </td>
<td>0</td>
<td>
<strong>_a</strong>Zulu (African people)<br><strong>_x</strong>History<br><strong>_y</strong>19th century.<br><strong>_9</strong>1270210 </td>
</tr>
<tr>
<th style="white-space:nowrap">650</th>
<td> </td>
<td>0</td>
<td>
<strong>_a</strong>Sociology, Military<br><strong>_z</strong>South Africa<br><strong>_z</strong>Zululand.<br><strong>_9</strong>1270211 </td>
</tr>
<tr>
<th style="white-space:nowrap">651</th>
<td> </td>
<td>0</td>
<td>
<strong>_a</strong>Zululand (South Africa)<br><strong>_x</strong>History, Military<br><strong>_y</strong>19th century.<br><strong>_9</strong>1270212 </td>
</tr>
<tr>
<th style="white-space:nowrap">856</th>
<td>4</td>
<td>0</td>
<td>
<strong>_u</strong>http://site.ebrary.com/lib/nationallibgovin/Doc?id=10856661<br><strong>_z</strong>An electronic book accessible through the World Wide Web; click to view </td>
</tr>
<tr>
<th style="white-space:nowrap">887</th>
<td> </td>
<td> </td>
<td>
<strong>_a</strong> Gopa  </td>
</tr>
<tr>
<th style="white-space:nowrap">905</th>
<td> </td>
<td> </td>
<td>
<strong>_a</strong>Gopa </td>
</tr>
<tr>
<th style="white-space:nowrap">949</th>
<td> </td>
<td> </td>
<td>
<strong>_A</strong>VIRTUAITEM<br><strong>_D</strong>10000<br><strong>_X</strong>206<br><strong>_6</strong>EBK000017588ENG<br><strong>_e</strong>EBK17588 </td>
</tr>
<tr>
<th style="white-space:nowrap">942</th>
<td> </td>
<td> </td>
<td>
<strong>_2</strong>ddc<br><strong>_c</strong>BKS </td>
</tr>
<tr>
<th style="white-space:nowrap">999</th>
<td> </td>
<td> </td>
<td>
<strong>_a</strong>VIRTUA<br><strong>_c</strong>2402943<br><strong>_d</strong>2402943 </td>
</tr>
<tr>
<th style="white-space:nowrap">999</th>
<td> </td>
<td> </td>
<td>
<strong>_a</strong>VTLSSORT0070*0080*0200*0201*0440*1000*2450*2600*3000*5000*6500*6501*6502*6510*8560*9050*9992 </td>
</tr>
</table>

</body>
</html>

A GREL like this gives me a way to finally produce data for exporting into MarcEdit from OpenRefine (RecordNumber | Tag | Indicators | Content) but the problem is that in many cases the result is not consistent for obvious reasons (like in absence of tag 100, tag 245 content is coming in the column of tag 100).


forEach(value.parseHtml().select("tr"),e,e.htmlText()).join("@@")

Result


000 01320nam a2200325 i 4500@@001 vtls001911996@@003 NLI@@005 20210807000205.0@@007 cr cn|||||||||@@008 160914t20142014ctuab s 001 0 eng d@@020 _a9780300180312@@020 _a9780300206197 (e-book)@@039 9 _a201610171143 _bgopag _y201609141656 _zgopag@@044 _actu@@100 1 _aLaband, John, _d1947- _91270208@@245 1 0 _aZulu warriors : _bthe battle for the South African frontier / _cJohn Laband.@@260 _aNew Haven, Connecticut : _bYale University Press, _c2014.@@300 _a1 online resource (390 p.) : _bill., maps@@500 _aIncludes index.@@650 0 _aZulu War, 1879. _91270209@@650 0 _aZulu (African people) _xHistory _y19th century. _91270210@@650 0 _aSociology, Military _zSouth Africa _zZululand. _91270211@@651 0 _aZululand (South Africa) _xHistory, Military _y19th century. _91270212@@856 4 0 _uhttp://site.ebrary.com/lib/nationallibgovin/Doc?id=10856661 _zAn electronic book accessible through the World Wide Web; click to view@@887 _a Gopa@@905 _aGopa@@949 _AVIRTUAITEM _D10000 _X206 _6EBK000017588ENG _eEBK17588@@942 _2ddc _cBKS@@999 _aVIRTUA _c2402943 _d2402943@@999 _aVTLSSORT0070*0080*0200*0201*0440*1000*2450*2600*3000*5000*6500*6501*6502*6510*8560*9050*9992

I’m still searching way out.

Best regards

Parthasarathi Mukhopadhyay

Hi @psm Most Koha installations have the OpenAPI endpoint available. You can make a RESTful call to the Koha OpenAPI endpoint, even if the OAI interface is disabled.

https://librarycatalogue.nvli.in/api/v1/

Example:
https://librarycatalogue.nvli.in/api/v1/items/2402943

I won’t go into details of how to search and use an OpenAPI endpoint, but with a bit of Jython, or GREL you should be able to explore and parse easier since the output will be JSON.

Koha REST API (koha-community.org)

The public API for this union catalogue is not open (either they are using a very old version of Koha or it is closed). So I was trying it through Webscraping and finally got success (the tutorial of Owen helped us a lot - fixmarc | Overdue Ideas). The dateset now looks like this in OpenRefine (version 3.7.2):

Now suppose I want to export all the records in a given language. I can Facet>Text facet on the column Tags to include 041 (MARC tag for language) and then try Text filer $aben (ben is code for Bengali language) to retrieve the rows 041 having value $aben.

In this situation exported file will include only tag 041 data as it likes this:

How is it possible to export all the records having 041 tag value as $aben? For example, in case of RecordNumber 29 (image 1) all the rows with RecodNumber 29 as it is having 041 tag value $aben.

Regards

The trick here is to use the OpenRefine Records mode. I think something like the following should work:

  1. In the "RecordNumber" column use Edit cells -> Blank down - this is to group all the properites/fields that belong to a single MARC record together
  2. Select "Records" at the top left of the data grid
  3. Create a custom text facet like:
    and(cells["Tags"].value=="041",cells["Content"].value=="$aben")

That should give true for all records that contain an 041 field with the value $aben. You need to do this as a custom text facet rather than as two separate facets/filters so that within the record the condition applies to a single row of data so it is the 041 field specifically that has the value, rather than the value just appearing somewhere in the record

Thanks @ostephens . Your solution is rocking as usual. :slight_smile:

Heartfelt regards.

1 Like