Number of matches between the columns

psm · November 15, 2022, 12:09pm

Dear all

In OpenRefine 3.6.2, I have columns like this (obtained after some processing works) -

NN	OB	OP	TF	Gold
System theory; Control theory; Automatic control	Automatic control; System theory; Control theory	Control theory; System theory; Automatic control	Control theory; Automatic control; System theory; Mechatronics; Vibration	Automatic control; Multibody systems; Vibration; Mechanics, Applied; System theory; Control theory; Thermodynamics; Heat engineering; Heat–Transmission; Mass transfer; Robotics; Automation; Physics

Is it possible through a suitable GREL to know how many exact matches are there between two columns (say Gold vs TF)?

Best regards

Parthasarathi Mukhopadhyay

ostephens · November 16, 2022, 11:40am

To check I have understood, in the example the number of matches between TF and Gold is 4 because “Automatic control”, “System theory”, “Vibration” and “Control theory” appear in both the TF and Gold columns?

Is that correct understanding?

psm · November 16, 2022, 7:29pm

Yes, you are right.

The comparison/matching case for Gold vs TF should report number of exact matches (here 4).

Similarly, Gold vs OP matching should report 3 (and 3 is also the figure for Gold vs OB and Gold vs NN here in this example).

Thanks and regards

ostephens · November 17, 2022, 11:47am

Thanks for the clarification
In this case, to (for example) to find the number of matches between the TF and GOLD columns you can make a custom text facet the GREL:
filter(cells["TF"].value.split("; "),v,cells["Gold"].value.split("; ").inArray(v)).length()

And of course substitute relevant column names to do other comparisons
This is splitting the cells in the first column mentioned into an array, and then for each value in that array checking whether it appears in the array created by splitting the cell in the second column mentioned.

There are other approaches, but this was the first one that occurred to me based on the data you shared

Owen

psm · November 17, 2022, 8:46pm

Thanks Owen.

It’s working as per our expectations. The inArray function is a new entity for us. Thanks for the nice explanation, as usual.

However, in some rows for the Gold column, one value is repeated more than once (as it is handcrafted, a few errors are present), and thereby it is producing the wrong number of matchings as it is finding more than one match but the same match (not unique). I’m sorry that I hadn’t noticed this weakness of the dataset during my earlier reporting. The table below is an example.

Is it possible to filter for unique values in the Gold column first, then match?

NN	OB	OP	TF	Gold	Gold-NN	Gold-OB	Gold-OP	Gold-TF
Agriculture; Sustainability; Environmental sciences	Agriculture; Sustainability	Agriculture; Sustainability	Agronomy; Agriculture; Agricultural ecology; Food security; Sustainability	Agriculture; Sustainability; Botany; Biotic communities; Soil science; Agriculture; Sustainability; Environmental sciences; Biotic communities; Soil science; Agriculture; Sustainability; Botany; Soil science; Agriculture; Sustainability; Environmental sciences; Soil science	10	8	8	8

Best regards

ostephens · November 17, 2022, 11:10pm

Yes that’s no problem. You can use .uniques() to remove any duplicates from the array. You could either do this on each column before you do the comparison:
value.split("; ").uniques(),join(“; “)

Or you could include it in the comparison GREL:

filter(cells["TF"].value.split("; ").uniques(),v,cells["Gold"].value.split("; ").uniques().inArray(v)).length()

psm · November 19, 2022, 5:53am

Thanks again Owen.

It’s now perfect for our purpose.

Best regards

Topic		Replies	Views
Hints, Tips and Tricks Support and Helpdesk hints-and-tips	5	799	January 2, 2023
GREL for exact word matching between columns Data cleaning and transformations	1	291	November 8, 2023
Matching class numbers Data cleaning and transformations	5	256	August 15, 2023
Extract unmatched values between 2 cells/columns with space or column separated data Support and Helpdesk	13	74	October 29, 2024
Combining the if and contains GREL functions to add a new column based on existing columns Support and Helpdesk	4	195	January 17, 2025

Number of matches between the columns

Related topics