HTML parsing (MARC data)

I have a dataset resulted from web-scraping from a library catalogue database. The structure is like this:

<body ID="opac-showmarc" class="branch-default" >

<div id="main">
<pre>LDR 01090n   a2200277 a 4500
000     BRB                 
001     454713
005     20230303095821.0
020    _a
       _cरु. 60.00
       _6880-00/(B
040    _aCRL, Kolkata
041    _amar
082    _a891.462
       _b
084    _aO155,2
100    _aकोटस्थाने, रमेश
       _b
       _c
       _d
       _6880-01/(B
245    _aरंग माझा वेगळा - मंत्र सुखाचा
       _b
       _c
       _6880-02/(B
260    _aपुणे
       _bनीहारा प्रकाशन
       _cएप्रिल 2021
       _6880-03/(B
300    _a48 पृ.
       _bचित्रे
       _c22 सें. मी.
       _e
       _6880-04/(B
500    _aदोन बालएकांकिका.
       _6880-05/(B
880    _cRu. 60.00
       _6020-00/(B
880    _aKoṭasthāne, Rameś
       _6100-01/(B
880    _aRaṅg mājhā vegaḷā - mantra sukhācā
       _6245-02/(B
880    _aPuṇe
       _bNīhārā Prakāśan
       _cEpril 2021
       _6260-03/(B
880    _a48 pṛ.
       _bcitre
       _c22 seṁ. mī.
       _6300-04/(B
880    _aDon bālekāṅkikā.
       _6500-05/(B
965    _aDB 58816
999    _aMR21590
       _c348001
       _d348001</pre>
</div>
</body>
</html>

A GREL like this value.parseHtml().select(“pre”)[0].htmlText() giving me output as:


LDR 01090n   a2200277 a 4500
000     BRB                 
001     454713
005     20230303095821.0
020    _a
       _cरु. 60.00
       _6880-00/(B
040    _aCRL, Kolkata
041    _amar
082    _a891.462
       _b
084    _aO155,2
100    _aकोटस्थाने, रमेश
       _b
       _c
       _d
       _6880-01/(B
245    _aरंग माझा वेगळा - मंत्र सुखाचा
       _b
       _c
       _6880-02/(B
260    _aपुणे
       _bनीहारा प्रकाशन
       _cएप्रिल 2021
       _6880-03/(B

How can I extract the data value from a tag like 082 (full) or a given sub field under it (082 _a)?

Regards

In this case the issue isn’t the HTML (which you’ve already successfully parsed and extracted the relevant content) but manipulating the MARC in the format that’s been embedded in the webpage

Unfortunately this isn’t incredibly straightforward because the MARC structure as represented here is tricky to work with. Assuming you have multiple MARC records extracted this way in you might be best off exporting just the column that contains the MARC data - from here you can do more manipulation in a text editor, use a specialist tool like MarcEdit or re-import to OpenRefine to work with the data in OR.

I posted a series on working with MARC data which includes using it in OpenRefine at fixmarc | Overdue Ideas. Your scenario is slightly different to the one I was dealing with in this series but it may give you some idea of how to proceed.

If you want to share some more detail (e.g. an example of where you are extracting the data from and what you need at the end) I might be able to offer more detailed advice

Dear Owen

Sorry for being late in response. I was trying the paths as advised by you.

I’m actually exploring the following - 1) Scrap the data from a library catalogue (details given below); 2) Use that dataset to extract a set of selected MARC tags (with content); and 3) prepare a text corpus containing title (245 tag), summary notes (5xx tags) in one column and subject indexing terms (650 tag) in another column.

The source site is here (unfortunately the OAI/PMH based metadata harvesting is not working for this union catalogue) - https://librarycatalogue.nvli.in/.

We can extract from here in the following ways -

type 1
https://librarycatalogue.nvli.in/cgi-bin/koha/opac-MARCdetail.pl?biblionumber=2402943

type 2 (html)					
https://librarycatalogue.nvli.in/cgi-bin/koha/opac-showmarc.pl?id=18687&viewas=html

type 3 (xml)
https://librarycatalogue.nvli.in/cgi-bin/koha/opac-showmarc.pl?id=18687&viewas=xml

Biblio numbers are sequential (id=) from 1 to 345,0500. In the earlier mail I produced results by following the type 3 data fetching.

The type 2 data fetching giving me the following results (example):

<html xmlns:marc="http://www.loc.gov/MARC21/slim">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>MARC View</title>
</head>
<body>

<table>
<tr>
<th style="white-space:nowrap">
					000
				</th>
<td colspan="2"></td>
<td>01320nam a2200325 i 4500</td>
</tr>
<tr>
<th style="white-space:nowrap">001</th>
<td colspan="2"></td>
<td>vtls001911996</td>
</tr>
<tr>
<th style="white-space:nowrap">003</th>
<td colspan="2"></td>
<td>NLI</td>
</tr>
<tr>
<th style="white-space:nowrap">005</th>
<td colspan="2"></td>
<td>20210807000205.0</td>
</tr>
<tr>
<th style="white-space:nowrap">007</th>
<td colspan="2"></td>
<td>cr cn|||||||||</td>
</tr>
<tr>
<th style="white-space:nowrap">008</th>
<td colspan="2"></td>
<td>160914t20142014ctuab   s     001 0 eng d</td>
</tr>
<tr>
<th style="white-space:nowrap">020</th>
<td> </td>
<td> </td>
<td>
<strong>_a</strong>9780300180312 </td>
</tr>
<tr>
<th style="white-space:nowrap">020</th>
<td> </td>
<td> </td>
<td>
<strong>_a</strong>9780300206197 (e-book) </td>
</tr>
<tr>
<th style="white-space:nowrap">039</th>
<td> </td>
<td>9</td>
<td>
<strong>_a</strong>201610171143<br><strong>_b</strong>gopag<br><strong>_y</strong>201609141656<br><strong>_z</strong>gopag </td>
</tr>
<tr>
<th style="white-space:nowrap">044</th>
<td> </td>
<td> </td>
<td>
<strong>_a</strong>ctu </td>
</tr>
<tr>
<th style="white-space:nowrap">100</th>
<td>1</td>
<td> </td>
<td>
<strong>_a</strong>Laband, John,<br><strong>_d</strong>1947-<br><strong>_9</strong>1270208 </td>
</tr>
<tr>
<th style="white-space:nowrap">245</th>
<td>1</td>
<td>0</td>
<td>
<strong>_a</strong>Zulu warriors :<br><strong>_b</strong>the battle for the South African frontier /<br><strong>_c</strong>John Laband. </td>
</tr>
<tr>
<th style="white-space:nowrap">260</th>
<td> </td>
<td> </td>
<td>
<strong>_a</strong>New Haven, Connecticut :<br><strong>_b</strong>Yale University Press,<br><strong>_c</strong>2014. </td>
</tr>
<tr>
<th style="white-space:nowrap">300</th>
<td> </td>
<td> </td>
<td>
<strong>_a</strong>1 online resource (390 p.) :<br><strong>_b</strong>ill., maps </td>
</tr>
<tr>
<th style="white-space:nowrap">500</th>
<td> </td>
<td> </td>
<td>
<strong>_a</strong>Includes index. </td>
</tr>
<tr>
<th style="white-space:nowrap">650</th>
<td> </td>
<td>0</td>
<td>
<strong>_a</strong>Zulu War, 1879.<br><strong>_9</strong>1270209 </td>
</tr>
<tr>
<th style="white-space:nowrap">650</th>
<td> </td>
<td>0</td>
<td>
<strong>_a</strong>Zulu (African people)<br><strong>_x</strong>History<br><strong>_y</strong>19th century.<br><strong>_9</strong>1270210 </td>
</tr>
<tr>
<th style="white-space:nowrap">650</th>
<td> </td>
<td>0</td>
<td>
<strong>_a</strong>Sociology, Military<br><strong>_z</strong>South Africa<br><strong>_z</strong>Zululand.<br><strong>_9</strong>1270211 </td>
</tr>
<tr>
<th style="white-space:nowrap">651</th>
<td> </td>
<td>0</td>
<td>
<strong>_a</strong>Zululand (South Africa)<br><strong>_x</strong>History, Military<br><strong>_y</strong>19th century.<br><strong>_9</strong>1270212 </td>
</tr>
<tr>
<th style="white-space:nowrap">856</th>
<td>4</td>
<td>0</td>
<td>
<strong>_u</strong>http://site.ebrary.com/lib/nationallibgovin/Doc?id=10856661<br><strong>_z</strong>An electronic book accessible through the World Wide Web; click to view </td>
</tr>
<tr>
<th style="white-space:nowrap">887</th>
<td> </td>
<td> </td>
<td>
<strong>_a</strong> Gopa  </td>
</tr>
<tr>
<th style="white-space:nowrap">905</th>
<td> </td>
<td> </td>
<td>
<strong>_a</strong>Gopa </td>
</tr>
<tr>
<th style="white-space:nowrap">949</th>
<td> </td>
<td> </td>
<td>
<strong>_A</strong>VIRTUAITEM<br><strong>_D</strong>10000<br><strong>_X</strong>206<br><strong>_6</strong>EBK000017588ENG<br><strong>_e</strong>EBK17588 </td>
</tr>
<tr>
<th style="white-space:nowrap">942</th>
<td> </td>
<td> </td>
<td>
<strong>_2</strong>ddc<br><strong>_c</strong>BKS </td>
</tr>
<tr>
<th style="white-space:nowrap">999</th>
<td> </td>
<td> </td>
<td>
<strong>_a</strong>VIRTUA<br><strong>_c</strong>2402943<br><strong>_d</strong>2402943 </td>
</tr>
<tr>
<th style="white-space:nowrap">999</th>
<td> </td>
<td> </td>
<td>
<strong>_a</strong>VTLSSORT0070*0080*0200*0201*0440*1000*2450*2600*3000*5000*6500*6501*6502*6510*8560*9050*9992 </td>
</tr>
</table>

</body>
</html>

A GREL like this gives me a way to finally produce data for exporting into MarcEdit from OpenRefine (RecordNumber | Tag | Indicators | Content) but the problem is that in many cases the result is not consistent for obvious reasons (like in absence of tag 100, tag 245 content is coming in the column of tag 100).


forEach(value.parseHtml().select("tr"),e,e.htmlText()).join("@@")

Result


000 01320nam a2200325 i 4500@@001 vtls001911996@@003 NLI@@005 20210807000205.0@@007 cr cn|||||||||@@008 160914t20142014ctuab s 001 0 eng d@@020 _a9780300180312@@020 _a9780300206197 (e-book)@@039 9 _a201610171143 _bgopag _y201609141656 _zgopag@@044 _actu@@100 1 _aLaband, John, _d1947- _91270208@@245 1 0 _aZulu warriors : _bthe battle for the South African frontier / _cJohn Laband.@@260 _aNew Haven, Connecticut : _bYale University Press, _c2014.@@300 _a1 online resource (390 p.) : _bill., maps@@500 _aIncludes index.@@650 0 _aZulu War, 1879. _91270209@@650 0 _aZulu (African people) _xHistory _y19th century. _91270210@@650 0 _aSociology, Military _zSouth Africa _zZululand. _91270211@@651 0 _aZululand (South Africa) _xHistory, Military _y19th century. _91270212@@856 4 0 _uhttp://site.ebrary.com/lib/nationallibgovin/Doc?id=10856661 _zAn electronic book accessible through the World Wide Web; click to view@@887 _a Gopa@@905 _aGopa@@949 _AVIRTUAITEM _D10000 _X206 _6EBK000017588ENG _eEBK17588@@942 _2ddc _cBKS@@999 _aVIRTUA _c2402943 _d2402943@@999 _aVTLSSORT0070*0080*0200*0201*0440*1000*2450*2600*3000*5000*6500*6501*6502*6510*8560*9050*9992

I’m still searching way out.

Best regards

Parthasarathi Mukhopadhyay

Hi @psm Most Koha installations have the OpenAPI endpoint available. You can make a RESTful call to the Koha OpenAPI endpoint, even if the OAI interface is disabled.

https://librarycatalogue.nvli.in/api/v1/

Example:
https://librarycatalogue.nvli.in/api/v1/items/2402943

I won’t go into details of how to search and use an OpenAPI endpoint, but with a bit of Jython, or GREL you should be able to explore and parse easier since the output will be JSON.

Koha REST API (koha-community.org)