Extract from Crazy Dates!

Ok, here is a puzzler:

I have "date" values in all kinds of crazy "formats":

[Germany] : [Unknown] [19th century]
[Gotha : Justus Perthes] [ca. 1850]
[Gotha : Justus Perthes] 1824
[London] : [Heather] 18082
[London] : Henry Hansard Printer 1850 ([London] : Standidge & Co. Litho. Old Jewry)
[London] 1671
[London] 1782
[Maastricht] : [Di Georgi] [1845]
[Madrid] : publisher not known [1622]
[Madrid] [1622]
[Ms] 1794
[Netherlands] : G.S. de Tempe [1868]
[Nürnberg] : [Homannianes Heredes] 1740
[Paris : Binet] [1839]
[Paris] : [Charles-Antoine Jombert] 1752
[Paris] 1775
[Plaats van uitgave niet vastgesteld] : [uitgever niet vastgesteld] [1752]
[Plaats van uitgave niet vastgesteld] : Topographisch Bureau [1825]
[Plaats van uitgave niet vastgesteld] [circa1840]
[Plaats van uitgave niet vastgesteld] 1752
1824

One thing they all have in common is that somewhere in them is a number YYYY.

How to extract?

Reading GREL docs on RegEx....

Progress so far:

This does most of what I want:
value.match(/(^|^.*?\D)(\d{4})($|0$)/)[1]

but leaves values like this:
16980

This, then, cleans that:
value.match(/(^|^.*?\D)(\d{4})($|0$)/)[1]
results in
1698

Now wondering how I can merge these two RegEx'es?....

Why not use facets like the Text Filter which also supports RegEx.
And then do the extraction of dates in various formats in 2 or 3+ passes?
Why does the extraction have to be in a single operation step in OpenRefine?
Where will the extracted dates go? Into a new column?

  1. Filter on dates (4 digit numbers) in a cells' value using value.find(/\d{4}/)
    • extract them into a column called DATE with value.find(/\d{4}/).join("|") because find() outputs an array of elements, so joining them with a pipe separator, just in case you have multiple 4 digit dates in a string?
  2. Filter on dates having th century using value.contains("th century") ?
    • extract or find them into the same column called DATE using value.find(/(\d{2})th century/).join("|")
  3. Filter on other date patterns discovered using Facets, maybe Roman Numerals?
    • extract them into the same column called DATE or a new column WEIRD-DATES

Because I was apparently questing for the One RegEx to Rule Them All?!

As I went to sleep last night this thought of your popped into my head:

I have two routines that ultimately result in what I want. Just run one, then run the other.

Not sure why I was so fixated on the notion of a Single Pass!

Your advice, MUCH appreciated,

Mark

1 Like