Tips on performing cross() between values that aren’t perfectly equal?

archilecteur · May 22, 2025, 3:25pm

Hello OR Community,

I have a list of published conference titles, on one hand, and a list of preliminary conference titles (as submitted to the conference organizers prior to selection), together with their abstracts, on the other hand. I have to cross-reference the data so as to associate the abstracts submitted at the time of evaluation with the titles of the conferences that were actually included in the congress program. (The abstracts were not published in the final program).

However, titles are sometimes changed along the way, or more simply, slight differences are introduced to facilitate indexing when the program is published. If I have to resort to losing abstracts whose titles have been completely reworked, I’d like to catch the ones that are similar. Ideally, I’d need a bit more robust solution than using fingerprint().

While I’m thinking about it, a function for antidictionary() would also be nice. (With help, I managed to build one in Google Sheets a few years ago).

Should a Samaritan revive Yatszhash FuzzyMatch extension?

b2m · May 23, 2025, 7:40am

Here are some other ideas and/or strategies to handle this situation ordered from easy to more complex:

You could create a "key" column from the titles in each of your projects e.g. by using fingerprint or other GREL functions and then use the "key" columns for cross. But as you already stated "fingerprint" might not be enough.
You could use csv-reconcile and use the more fuzzy and user friendly reconciling process in OpenRefine.
You could load both projects into one and use the new custom clustering methods to identify the matching titles.
You could write your own matching functions/algorithms and call them via Jython.

The strategy of course also depends on whether you are able to write and or run Python scripts in your environment.

thadguidry · May 23, 2025, 12:33pm

String Similarity is the problem domain in this case. Almost all of those functions are available in Clustering. Outside of OoenRefine, there are many other tools that might perform better for the fuzzy lookup or fuzzy record linkage that you are specifically needing. To start, take a look at Fuzzy Lookup Add-in for Excel https://www.microsoft.com/en-us/download/details.aspx?id=15011

archilecteur · May 23, 2025, 1:57pm

Thanks Benjamin and Thad for the answers.

Ideally, I'd like to invoke the more advanced clustering algorithms when running cross(). Loading both datasets in the same project might be feasible in this case, and work with a combination of clustering runs and some custom manipulations.

Nevertheless, I'll probably have to go the reconciliation route at some point, as I have a few more cases of similar tasks to complete this summer, and it's piling up.

Is there a more concise way to use stopwords in GREL than: word, (word != "le").and(word != "les").and(word != "la").and...)?

b2m · May 24, 2025, 4:01pm

How about using inArray for that.

["le", "les", "la"].inArray("needle")

archilecteur · May 26, 2025, 3:29pm

Hi @b2m,

Not sure how to integrate inArray to accomplish what I’m looking for (in a custom clustering, add to a transformation of the type fingerprint the exclusion of a list of words). As is done here, but more concisely, in the case of a (much) longer list.

Since Friday I did read about this Jython solution, and I might take that route.

But still curious about the articulation of inArray with custom clustering, if it’s doable.

In any case, thank you for the replies!

b2m · May 27, 2025, 6:14am

I guess it is easier to first apply fingerprint and then perform the filtering of your stopwords.

In GREL this looks like:

filter(
    fingerprint(value).split(" "),
    v,
    not(["le", "les", "la"].inArray(v))
).join(" ")

But when you have really big stopword lists, then using Jython to load them from a file on disk is also feasible.

archilecteur · May 27, 2025, 7:29pm

Actually an elegant solution there. For a short list of stopwords anyway.

Thank you Benjamin!

Topic		Replies	Views
Clustering based on several columns as conditions Data cleaning and transformations	4	712	February 5, 2023
Cluster and edit function returns similar values Support and Helpdesk	3	70	June 12, 2024
How can I reconcile from my own CSV file? Support and Helpdesk reconciliation	3	911	May 31, 2023
Hints, Tips and Tricks Support and Helpdesk hints-and-tips	5	805	January 2, 2023
What is the best method for matching similar records within a dataset? Support and Helpdesk	2	36	November 22, 2024

Tips on performing cross() between values that aren’t perfectly equal?

Related topics