Parsing puzzler

My Parsing Puzzler for the day!

Given this HTML:

<div class="dato">Título uniforme</div>
<div class="valor">
[La Guardia (Pontevedra). Cartas náuticas. 1779. ]
<div class="dato">Map data</div>
<div class="valor">
Escala [ca. 1:6.600]. 1200 varas castellanas [= 15,2 cm] 

value.parseHtml().select("div[class='dato']:containsOwn(Título uniforme) ~ div[class='valor']")[0].htmlText().toString()
results in:
[La Guardia (Pontevedra). Cartas náuticas. 1779. ]

value.parseHtml().select("div[class='dato']:containsOwn(Map data) ~ div[class='valor']")[0].htmlText().toString()
results in error:
Error: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0

Why does the first GREL expression return an array with a single value, but the second one does not?

Puzzled. Searching...


I get "Escala [ca. 1:6.600]. 1200 varas castellanas [= 15,2 cm]" when operating on a cell with the HTML representing those two divs.

Is there more context which has been omitted?


1 Like

Yes, between those two "dato - valor" div groupings is this one:

<div class="dato">Title</div>
<h1 class="valor">
Plano y explicación de la costa comprehendª entre la Barra del N. del Rio Miño y la Pta. de la Laxa del Perro, con el Pto. de la Guardia sitdo. pr. Obson. en la Latd N. de41º  52'. Levantado pr. Orn. del Rey en Abril de 1779 [Material cartográfico] / Levantado por Plancheta por el 2º Piloto de la Rl. Armada Juan Patricio García

And the difference here is that the "valor" portion of this is in an h1, not a div. I think that's the Clue I needed! So something wrong with my use of the first operator in this:

~ div[class='valor']

Reading on that in the Jsoup Selector docs.

Thanks again for the Clue I needed!


@mcyzyk Hi Mark! It's always good to reply with the answer once you discover it. This helps others in our community who might search HTML parsing issues and need your answer to help them. Thanks for using OpenRefine!

1 Like

It's always good to reply with the answer once you discover it.

Will do. (I have not yet, though, found the answer. I just don't see why my expression returns the
Título uniforme but not the Map data DIV value. I've played with other Jsoup operators; same problem.)

An anecdote: A couple years ago I was stumped on something with another software application and so went and searched their Forums for an answer. "Oh, here's a guy who had exactly the same problem as what I now have!" I thought. "I've found gold." I read the long post, with code examples, something about the style of writing was so familiar. At the bottom, it was signed by me. I had posted it years before, along with the answer to the issue I was having!

Ok, it's looking like there is nothing wrong with my GREL expression above. Rather, there is something idiosyncratic about the file I'm parsing (and I've found an alternative file with the same data that parses just fine). So I am going to close this issue.