Search for multiple strings using Text Filter ReGex

Hello,

I have searched the forum and online but haven't found an answer yet. Can I use the RegEx to search for more than one string. For example, I want to search and return for 'Glenralloch' and 'Glenrallock' in the Field Coll Place Description column. I know they both exist but want to select all records which have either one or the other of these.

This is a simple example of two similar words (k and h at end only difference) I can return by using the text filter with 'Glenrall' as below. I want to be able to search and return entries for strings that are very different e.g. Mount Binser and Glenralloch using a RegEx (with box checked) to return all records that have either Mount Binser or Glenralloch (and AND function search essentially).

I have tried value.match(\Glenralloch|Glenrralock) but it did not work in the Text Filter.

Any guidance or help on this is greatly appreciated.

Thanks,
Johno

Text Filter input box is pure Regex syntax, and not GREL.

Thad Guidry

Ok. I am not a native coder, sorry, I didn't understand the difference... How would I search for and return records from the Field Collection Place Description for the following strings 'Mount Binser' and 'One Mile Creek'.

I would just use a Text Facet and then click include next to each phrase in the Text Facet.
Our docs are searchable from our website and here’s the specific page on Exploring Data

We have links to other external resources like tutorial videos etc. Just use the site search and type “video tutorial”.

Also. OpenRefine was made for folks that do not know how to code to clean or analyze data. Made for you! So welcome to OpenRefine! Feel free to create new topics to discuss as you continue to learn.

Thanks for your response, Thad. I probably didn't frame my initial query very well (apologies).

I have been doing it the way you describe, but it is a manual process. For context, I work in a museum with natural history collection data. In the data I am working with, there are about 20,000 unique facets of the column I am working with. Often, we have a range of place names for specimens that are from the same place.

image

In the 20,000 unique strings there could upto 50 different iterations of the same place, example below for Grasmere Stream.

Cass, Grasmere Stream pond 1
Grasmere Stream, near Sarah Bridge 1
Grasmere Stream, tributary 1.2 kilometres north, Cass 1
Lower Bush and Middle Bush streams, Cass5
Middle Bush Stream, Lower, Cass4
Middle Bush Stream, south bridge, Cass4
Reservoir Bush and Middle Bush streams, Cass4
Reservoir and Middle Bush streams, Cass3
Middle Bush stream, north, Cass2
Middle Bush Stream, South branch, Cass2
Lower bush and Middle bush streams, Cass 1
Middle Bush Stream N, Cass 1
Middle Bush stream north, Cass 1
Middle Bush stream, Cass 1
Middle Bush stream, north branch, Cass 1
Cass Bush Stream 4
Cass Stream
(this is a subset of all facets that are relevant to Grasmere Stream)

I am doing text filter searches for all the different iterations and click include on all the entries on the text facet, but it is not super-efficient manually clicking the include button for all searches (e.g. 'Middle Bush Stream', 'Grasmere Steam', 'Lower Bush Stream', 'Cass Stream', Cass Bush Stream').

Once I have all the relevant facets selected, I am adding geographic context data to all records in other columns. This may be the only way to do this, but I hope there may be a way to return different facet iterations using the text filter, or some other OpenRefine tool/mechanism.

If this is not possible, no worries at all. Just looking for ways to streamline the process where I can.

Many thanks

There are some separator patterns in those places like “,” and maybe “-“ etc.

It might be wise to make a copy of that column that splits them so that you can use Clustering and also Reconciliation.

Either the column menu or use GREL with the Add Column based on with our various String split functions (or partition functions if it makes things easier)

Once you have a column(s) of Place names, then you can Reconcile perhaps against Wikidata or another recon service. Separating various parts of your long strings is the first step. So you can easily reconcile like all the states and then use that State column as a supplied property to reconcile your next column of city or general region. Wikidata has robust identifiers for Geographical entities. Really good tutorials out there showing how to accomplish what I generally described above.

To be more efficient for that kind of data, you need to pluck out and create a new column based on contains(). For example, your wide values of Grasmere stream sub locations mostly either contain the string “grasmere” or “cass” right? So then you might make a custom text facet on only those such as value.toLowercase().contains(“grasmere”) where you can also combine and use our Boolean functions like or() to look if the values contain “cass” or “grasmere”.

Finding the real common patterns is key here and where various features of OpenRefine can help more than others, like Clustering.

To answer the original question:

How would I search for and return records from the Field Collection Place Description for the following strings 'Mount Binser' and 'One Mile Creek'.

Select "Text Filter" on your desired column, click the "regular expression" check box, and enter the following string in the text field:

Mount Binser|One Mile Creek

You can get fancier with something like

^(Reservoir|Middle) Bush stream

to match either "Reservoir Bush stream" or "Middle Bush stream", but only at the beginning of the string. Regular expressions are very powerful, but can be a little bit difficult to learn (and debug).

If you're trying to debug a non-trivial regular expression, there are a number of websites which allow you to test a regular expression against example data and get feedback on how things are being matched. I usually use them to fine tune my expressions rather than trying to develop them directly in OpenRefine.

Tom

As @tfmorris says, using the text filter with a regular expression like:

string 1|string 2|string 3

will do the job.

An alternative is to create a custom text facet as follows:

  1. Choose the menu option Facet -> Custom text facet from the drop down on the column you want to filter
  2. Write GREL expression something like:
value.find(/string 1|string 2|string 3/i).length()>0

This is a pretty similar approach to using the text filter to be honest - it's a little more fiddly because the formulation is a little more complex (some stuff that's automatically done for you in the text filter like the i after the second / which makes the match case-insensitive) but the nice thing is that once you've applied this once the expression will be stored in your GREL expression History and so it's very easy to re-use in a later session.

The other nice thing about using a custom text facet is you can use more complex GREL approaches (and combine them with OR or AND if necessary) where you need to. I use custom text facets a lot in my work and feel its worth mastering them as an approach to data filtering and exploration.

All that said - being able to filter for a list of values is a use case I've heard several times and it feels like we should try to support more directly than building a regular expression (or other method) to find cells that contain a value from the list. So I've created a new feature request for that Easily filter rows based on a list of values · Issue #6189 · OpenRefine/OpenRefine · GitHub - feel free to contribute thoughts on there if you have ideas on how this could work

Thank you all for the comments and suggestions. I will have a play around with creating a copy column on Field Collection Place Description and use the clustering algorithms to group like terms for bulk updates. I have only started using reconciliation in OpenRefine, for scientific names, but keen to start using WikiData, will check out some tutorials/resources on this.

The | separator in text filter has definitely helped! I haven't played around with custom text facet but see how this would be helpful, by creating a new column that returns TRUE and FALSE values to facet on. I have used and stored this in the expression history, which is such a helpful feature!

Appreciate how responsive the OR community.

Thanks so much again!

1 Like