Number of matches between the columns

Dear all

In OpenRefine 3.6.2, I have columns like this (obtained after some processing works) -

NN OB OP TF Gold
System theory; Control theory; Automatic control Automatic control; System theory; Control theory Control theory; System theory; Automatic control Control theory; Automatic control; System theory; Mechatronics; Vibration Automatic control; Multibody systems; Vibration; Mechanics, Applied; System theory; Control theory; Thermodynamics; Heat engineering; Heat–Transmission; Mass transfer; Robotics; Automation; Physics

Is it possible through a suitable GREL to know how many exact matches are there between two columns (say Gold vs TF)?

Best regards

Parthasarathi Mukhopadhyay

To check I have understood, in the example the number of matches between TF and Gold is 4 because “Automatic control”, “System theory”, “Vibration” and “Control theory” appear in both the TF and Gold columns?

Is that correct understanding?

Yes, you are right.

The comparison/matching case for Gold vs TF should report number of exact matches (here 4).

Similarly, Gold vs OP matching should report 3 (and 3 is also the figure for Gold vs OB and Gold vs NN here in this example).

Thanks and regards

Thanks for the clarification
In this case, to (for example) to find the number of matches between the TF and GOLD columns you can make a custom text facet the GREL:
filter(cells["TF"].value.split("; "),v,cells["Gold"].value.split("; ").inArray(v)).length()

And of course substitute relevant column names to do other comparisons
This is splitting the cells in the first column mentioned into an array, and then for each value in that array checking whether it appears in the array created by splitting the cell in the second column mentioned.

There are other approaches, but this was the first one that occurred to me based on the data you shared

Owen

1 Like

Thanks Owen.

It’s working as per our expectations. The inArray function is a new entity for us. Thanks for the nice explanation, as usual.

However, in some rows for the Gold column, one value is repeated more than once (as it is handcrafted, a few errors are present), and thereby it is producing the wrong number of matchings as it is finding more than one match but the same match (not unique). I’m sorry that I hadn’t noticed this weakness of the dataset during my earlier reporting. The table below is an example.

Is it possible to filter for unique values in the Gold column first, then match?

NN OB OP TF Gold Gold-NN Gold-OB Gold-OP Gold-TF
Agriculture; Sustainability; Environmental sciences Agriculture; Sustainability Agriculture; Sustainability Agronomy; Agriculture; Agricultural ecology; Food security; Sustainability Agriculture; Sustainability; Botany; Biotic communities; Soil science; Agriculture; Sustainability; Environmental sciences; Biotic communities; Soil science; Agriculture; Sustainability; Botany; Soil science; Agriculture; Sustainability; Environmental sciences; Soil science 10 8 8 8

Best regards

Yes that’s no problem. You can use .uniques() to remove any duplicates from the array. You could either do this on each column before you do the comparison:
value.split("; ").uniques(),join(“; “)

Or you could include it in the comparison GREL:

filter(cells["TF"].value.split("; ").uniques(),v,cells["Gold"].value.split("; ").uniques().inArray(v)).length()

Thanks again Owen.

It’s now perfect for our purpose.

Best regards