Find repeated (doubled) words within cells

Will · May 30, 2023, 7:16pm

I have some faulty OCR data containing some doubled words. For example, a cell might contain

the quick quick fox ran over the lazy dog

Is there a regex that will allow me to filter for only records containing doubled words? The doubled word is different in each instance, obviously.

thadguidry · May 31, 2023, 1:37pm

Got you covered… visit our Recipes on our Wiki, where I put a section long ago regarding just this use case. You have a few options as you’ll read it with GREL or Jython.

OpenRefine Wiki → Recipes - Duplicate Patterns

ostephens · June 1, 2023, 1:20pm

I've just added a couple more recipes to that page that @thadguidry mentions as I wasn't sure it actually covered the scenario you needed.

If I've understood your use case correctly I think you'll want the following:

Finding cells that contain a repeated consecutive word

A way to identify cells that contain the same word repeated consecutively (i.e. the same word appears two or more times in a row within the cell value) is to create a Custom Text Facet using the following GREL expression:
filter(value.replace(/\W/," ").ngram(2),n,n.split(" ").uniques().length()!=2).length()>0
This will give an outcome of true if there are consecutively repeated words anywhere in the string, and false otherwise.

The expression works by first replaces any non word characters with spaces using the regular expression class of 'non word characters', and then splits the resulting string into pairs of words (word ngrams of length 2) and looks for any ngrams that do not contain two words if you split the ngram into a list of its constituent words and deduplicate the list. If the number of the ngrams that don't contain two words on de-duplication is more than zero there is at least one instance of a repeated consecutive word in the string.

Will · June 2, 2023, 4:25pm

Thanks very much Owen and Thad. This technique is working well, and I’ve learned a lot about custom expressions.

FWIW, this formula returns some false positives with two-word cells, and even some blank cells.

For example, it returns “true” for these strings

CODE CIVIL CHILIEN, promulgué le 14 décembre 1855, entré en vigueur le ler janvier 1857 (traduit, annoté et précédé d'une introduction par H. Prudhomme).
Lettre des Etats-Unis. La situation des Etats-Unis vis-à-vis de la Convention internationale.

Not sure if there’s some hidden text in there that I can’t find?

ostephens · June 3, 2023, 2:20pm

Ah - looks like the use of \W here is not ideal - as it also removes any accented characters - which leads (for example) to the situation where when we do value.replace(/\W/,"") on a phrase like vis-à-vis we get vis vis which then results in it looking like a duplicate word

Tweaking this a bit you could try:

filter(value.replace(/[^\p{L}\p{N}]/," ").ngram(2),n,n.split(" ").uniques().length()!=2).length()>0

Here the regular expression [^\p{L}\p{N}] is looking for characters that aren’t in either the unicode Letter category \p{L} or the Unicode Number category \p{N}. For a full list of the regular expression unicode categories see Regex Tutorial - Unicode Characters and Properties

Let me know if this doesn’t do the job and I’ll take another look!

Will · June 6, 2023, 11:37am

Thank you, Owen. This tweak works very well. Also, I appreciate the link to the Regex Unicode tutorial, which is super helpful in tracking down random OCR errors.

One problem remains: this formula still returns (some? all?) two-word cells.

ostephens · June 6, 2023, 11:40am

Can you provide an example?

Topic		Replies	Views
GREL for exact word matching between columns Data cleaning and transformations	1	282	November 8, 2023
Reordering text within a cell Data cleaning and transformations	2	203	April 1, 2024
Search for multiple strings using Text Filter ReGex Support and Helpdesk	10	672	November 26, 2023
Filtering/Keep words based on list in column Support and Helpdesk	5	234	December 4, 2023
Creating a conditional (if-then) regex Support and Helpdesk	4	549	June 12, 2023

Find repeated (doubled) words within cells

Related topics