Find repeated (doubled) words within cells

I have some faulty OCR data containing some doubled words. For example, a cell might contain

the quick quick fox ran over the lazy dog

Is there a regex that will allow me to filter for only records containing doubled words? The doubled word is different in each instance, obviously.

Got you covered… visit our Recipes on our Wiki, where I put a section long ago regarding just this use case. You have a few options as you’ll read it with GREL or Jython.

OpenRefine Wiki → Recipes - Duplicate Patterns

1 Like

I’ve just added a couple more recipes to that page that @thadguidry mentions as I wasn’t sure it actually covered the scenario you needed.

If I’ve understood your use case correctly I think you’ll want the following:

Finding cells that contain a repeated consecutive word

A way to identify cells that contain the same word repeated consecutively (i.e. the same word appears two or more times in a row within the cell value) is to create a Custom Text Facet using the following GREL expression:

filter(value.replace(/\W/," ").ngram(2),n,n.split(" ").uniques().length()!=2).length()>0

This will give an outcome of true if there are consecutively repeated words anywhere in the string, and false otherwise.

The expression works by first replaces any non word characters with spaces using the regular expression class of ‘non word characters’, and then splits the resulting string into pairs of words (word ngrams of length 2) and looks for any ngrams that do not contain two words if you split the ngram into a list of its constituent words and deduplicate the list. If the number of the ngrams that don’t contain two words on de-duplication is more than zero there is at least one instance of a repeated consecutive word in the string.

2 Likes

Thanks very much Owen and Thad. This technique is working well, and I’ve learned a lot about custom expressions.

FWIW, this formula returns some false positives with two-word cells, and even some blank cells.

For example, it returns “true” for these strings

  • CODE CIVIL CHILIEN, promulgué le 14 décembre 1855, entré en vigueur le ler janvier 1857 (traduit, annoté et précédé d'une introduction par H. Prudhomme).
  • Lettre des Etats-Unis. La situation des Etats-Unis vis-à-vis de la Convention internationale.

Not sure if there’s some hidden text in there that I can’t find?

Ah - looks like the use of \W here is not ideal - as it also removes any accented characters - which leads (for example) to the situation where when we do value.replace(/\W/,"") on a phrase like vis-à-vis we get vis vis which then results in it looking like a duplicate word

Tweaking this a bit you could try:

filter(value.replace(/[^\p{L}\p{N}]/," ").ngram(2),n,n.split(" ").uniques().length()!=2).length()>0

Here the regular expression [^\p{L}\p{N}] is looking for characters that aren’t in either the unicode Letter category \p{L} or the Unicode Number category \p{N}. For a full list of the regular expression unicode categories see Regex Tutorial - Unicode Characters and Properties

Let me know if this doesn’t do the job and I’ll take another look!

Thank you, Owen. This tweak works very well. Also, I appreciate the link to the Regex Unicode tutorial, which is super helpful in tracking down random OCR errors.

One problem remains: this formula still returns (some? all?) two-word cells.

Can you provide an example?