Detecting patterns in a text mass

I have a dataset that I was not able to scrape in a way that it would have been more structured. This resulted in blocks of text, where I can identify patterns for finding the right data. Creating the right regex statement has been challenging.

In the middle of the text block, there could be a line:

"Année d’inclusion à l’inventaire : 2021"

Which regex statement to use to detect this. I have tried
value.match(/.*Année d'inclusion à l'inventaire(.+)$/)[0]
value.find(/.*Année d'inclusion à l'inventaire(.+)$/)[0]
value.match(/.*|\nAnnée d'inclusion à l'inventaire(.+)$/)[0]
and some other variations

I manage to do this with single-line data.

Thank you for your help :slight_smile:

With match in OpenRefine you have to define a regular expression that matches the complete content. Whereas with find in OpenRefine you can define a regular expression that matches only on a substring of the content.

As far as I understand you are searching for a line with a variable part in a multi line text statement.
Usually you could use some "modifiers" to tell the regular expression engine to interpret $ as the end of a line instead of the end of the whole text.

AFAIK this is not supported with find in OpenRefine, but we can either use a more detailed regular expression or the linebreak \n instead.

More detailed version:

value.find(/Année d.inclusion à l.inventaire : \d+/)[0]

Using linebreak:

value.find(/Année d.inclusion à l.inventaire[^\n]+/)

Note: The OpenRefine forum encodes ' differently than (my) OpenRefine. So I replaced them with the universal . in the regular expressions.

1 Like

Thank you for the response, I will try this!