Got you covered⊠visit our Recipes on our Wiki, where I put a section long ago regarding just this use case. You have a few options as youâll read it with GREL or Jython.
Iâve just added a couple more recipes to that page that @thadguidry mentions as I wasnât sure it actually covered the scenario you needed.
If Iâve understood your use case correctly I think youâll want the following:
Finding cells that contain a repeated consecutive word
A way to identify cells that contain the same word repeated consecutively (i.e. the same word appears two or more times in a row within the cell value) is to create a Custom Text Facet using the following GREL expression:
This will give an outcome of true if there are consecutively repeated words anywhere in the string, and false otherwise.
The expression works by first replaces any non word characters with spaces using the regular expression class of ânon word charactersâ, and then splits the resulting string into pairs of words (word ngrams of length 2) and looks for any ngrams that do not contain two words if you split the ngram into a list of its constituent words and deduplicate the list. If the number of the ngrams that donât contain two words on de-duplication is more than zero there is at least one instance of a repeated consecutive word in the string.
Ah - looks like the use of \W here is not ideal - as it also removes any accented characters - which leads (for example) to the situation where when we do value.replace(/\W/,"") on a phrase like vis-Ă -vis we get vis vis which then results in it looking like a duplicate word
Here the regular expression [^\p{L}\p{N}] is looking for characters that arenât in either the unicode Letter category \p{L} or the Unicode Number category \p{N}. For a full list of the regular expression unicode categories see Regex Tutorial - Unicode Characters and Properties
Let me know if this doesnât do the job and Iâll take another look!
Thank you, Owen. This tweak works very well. Also, I appreciate the link to the Regex Unicode tutorial, which is super helpful in tracking down random OCR errors.
One problem remains: this formula still returns (some? all?) two-word cells.