Detecting patterns in a text mass

Susanna_Anas · December 10, 2023, 12:46pm

I have a dataset that I was not able to scrape in a way that it would have been more structured. This resulted in blocks of text, where I can identify patterns for finding the right data. Creating the right regex statement has been challenging.

In the middle of the text block, there could be a line:

"Année d’inclusion à l’inventaire : 2021"

Which regex statement to use to detect this. I have tried
value.match(/.*Année d'inclusion à l'inventaire(.+)$/)[0]
value.find(/.*Année d'inclusion à l'inventaire(.+)$/)[0]
value.match(/.*|\nAnnée d'inclusion à l'inventaire(.+)$/)[0]
and some other variations

I manage to do this with single-line data.

Thank you for your help

b2m · December 11, 2023, 7:37am

With match in OpenRefine you have to define a regular expression that matches the complete content. Whereas with find in OpenRefine you can define a regular expression that matches only on a substring of the content.

As far as I understand you are searching for a line with a variable part in a multi line text statement.
Usually you could use some "modifiers" to tell the regular expression engine to interpret $ as the end of a line instead of the end of the whole text.

AFAIK this is not supported with find in OpenRefine, but we can either use a more detailed regular expression or the linebreak \n instead.

More detailed version:

value.find(/Année d.inclusion à l.inventaire : \d+/)[0]

Using linebreak:

value.find(/Année d.inclusion à l.inventaire[^\n]+/)

Note: The OpenRefine forum encodes ' differently than (my) OpenRefine. So I replaced them with the universal . in the regular expressions.

Susanna_Anas · December 16, 2023, 9:25pm

Thank you for the response, I will try this!

Topic		Replies	Views
Search for multiple strings using Text Filter ReGex Support and Helpdesk	10	734	November 26, 2023
Mining information from PDF URLs using regex Support and Helpdesk	1	397	December 19, 2022
Playing a bit with ChatGPT to make my life easier - I think it can help (?) users generate regex, GREL...? Development & Design	4	1291	March 24, 2023
Creating a conditional (if-then) regex Support and Helpdesk	4	563	June 12, 2023
VBQ (Very Basic Question): Add a column based on this column (but only with part of the string?) Support and Helpdesk hints-and-tips	2	315	March 16, 2023

Detecting patterns in a text mass

Related topics