Parsing Html soup

How do I extract the value 52.25, preferably by title, from this piece of html? This is what I see using browser inspect

<span class="edit-mode move-mode" data-key="lat" title="latitude">52.25</span>

This is when I fetch the html from the link I got reconciling. It's 1400+ lines of html.

<span class="edit-mode move-mode" data-key="lat" title="latitude">{{lat}}</span>

From the documentation and this and this, I tried a number of expressions in a select

value.parseHtml().select("span[title='latitude']")
value.parseHtml().select("span.edit-mode move-mode")

but they all return an empty array, which I don't understand. I'm wondering if the browser does some additional processing that OR just fetching the URL's does not.

Not sure if this is of any use, here's part of the div that I'm looking at.

<div class="delete-hide">
    {{#cc2}}<div>
        <small>alternate countries : {{cc2}}</small></div>{{/cc2}} {{#population}}<div>population : {{population}}</div>{{/population}}
    <form class="form-inline">
        <span class="edit-mode move-mode" data-key="lat" title="latitude">{{lat}}</span>,
        <span class="edit-mode move-mode" data-key="lng" title="longitude">{{lng}}</span>

(rest of the div omitted for clarity).

I'm a n00b at CSS selectors. I'd appreciate any help with this.

This looks right to me, and works for me on the sample HTML you post

However, it is important to note that this expression will produce and array, which cannot be stored in an OpenRefine cell - so if I were to just leave the expression like this although the preview looks fine, it would result in an empty cell if I went ahead and left it like this

There are various ways to approach getting the actual value from the array but this is one:

forEach(value.parseHtml().select("span[title='latitude']"),v,v.ownText()).join("|")

This takes each result in the array generated by the select and for each value in the array (which could be one but could be many) extract the ownText() (to get the value you want without the rest of the HTML) and then joins the list (which again - could just be one value!) into a pipe separated string

Hope that makes sense

Thanks Owen, it does make sense, unfortunately it does not work for me.

This is a link that I got reconciling: https://sws.geonames.org/2759381/

I did Add column by fetching URL's, and then

  • just 'value, which gives me a 1400+ line cell, after which value.parseHtml().select("span[title='latitude']")` returns an empty array.
  • forEach(value.parseHtml().select("span[title='latitude']"),v,v.ownText()).join("|") returns empty cells.

I'm wondering if the returned page is too big and I first need to extract a smaller section?

I'm on OR Version 3.8.0 [TRUNK], No idea if that's relevant.

Edit: looking at it again, I see now that what I need is inside a <script ...> that isn't executed yet. Hmmm.

@RolfBly so in this case what I'd recommend is retrieving the machine readable data - in this case it's easy to fetch the rdf by doing (with the example URL you give) Add column by fetching URL's -> value+"about.rdf"

This will get you back the much more manageable data:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<rdf:RDF xmlns:cc="http://creativecommons.org/ns#" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:gn="http://www.geonames.org/ontology#" xmlns:owl="http://www.w3.org/2002/07/owl#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:wgs84_pos="http://www.w3.org/2003/01/geo/wgs84_pos#">
    <gn:Feature rdf:about="https://sws.geonames.org/2759381/">
        <rdfs:isDefinedBy rdf:resource="https://sws.geonames.org/2759381/about.rdf"/>
        <gn:name>Bathmen</gn:name>
        <gn:alternateName>Batmen</gn:alternateName>
        <gn:featureClass rdf:resource="https://www.geonames.org/ontology#P"/>
        <gn:featureCode rdf:resource="https://www.geonames.org/ontology#P.PPL"/>
        <gn:countryCode>NL</gn:countryCode>
        <gn:postalCode>7437</gn:postalCode>
        <wgs84_pos:lat>52.25</wgs84_pos:lat>
        <wgs84_pos:long>6.2875</wgs84_pos:long>
        <gn:parentFeature rdf:resource="https://sws.geonames.org/2756986/"/>
        <gn:parentCountry rdf:resource="https://sws.geonames.org/2750405/"/>
        <gn:parentADM1 rdf:resource="https://sws.geonames.org/2748838/"/>
        <gn:parentADM2 rdf:resource="https://sws.geonames.org/2756986/"/>
        <gn:nearbyFeatures rdf:resource="https://sws.geonames.org/2759381/nearby.rdf"/>
        <gn:locationMap rdf:resource="https://www.geonames.org/2759381/bathmen.html"/>
        <gn:wikipediaArticle rdf:resource="https://en.wikipedia.org/wiki/Bathmen"/>
        <rdfs:seeAlso rdf:resource="https://dbpedia.org/resource/Bathmen"/>
    </gn:Feature>
    <foaf:Document rdf:about="https://sws.geonames.org/2759381/about.rdf">
        <foaf:primaryTopic rdf:resource="https://sws.geonames.org/2759381/"/>
        <cc:license rdf:resource="https://creativecommons.org/licenses/by/4.0/"/>
        <cc:attributionURL rdf:resource="https://www.geonames.org"/>
        <cc:attributionName rdf:datatype="https://www.w3.org/2001/XMLSchema#string">GeoNames</cc:attributionName>
        <dcterms:created rdf:datatype="https://www.w3.org/2001/XMLSchema#date">2006-01-15</dcterms:created>
        <dcterms:modified rdf:datatype="https://www.w3.org/2001/XMLSchema#date">2017-10-17</dcterms:modified>
    </foaf:Document>
</rdf:RDF>

You can. then use value.parseXml().select("wgs84_pos|lat") to get the element that has the latitude in - and then as you wish following the same pattern as I posted above

1 Like

I'm wondering if the browser does some additional processing that OR just fetching the URL's does not.

Edit: looking at it again, I see now that what I need is inside a <script ...> that isn't executed yet. Hmmm.

Exactly. The HTML is being rendered on the fly by the Javascript and OpenRefine doesn't execute Javascript that it fetches.

Geonames has XML and JSON APIs as well RDF downloads. Why not use one of those?

If you fetch https://www.geonames.org/2759381/about.rdf
you can get the latitude with the GREL expression:

value.parseXml().select("wgs84_pos|lat")[0].xmlText()

Tom

3 Likes

Works like a charm! Thanks guys.