Tips on performing cross() between values that aren’t perfectly equal?

Hello OR Community,

I have a list of published conference titles, on one hand, and a list of preliminary conference titles (as submitted to the conference organizers prior to selection), together with their abstracts, on the other hand. I have to cross-reference the data so as to associate the abstracts submitted at the time of evaluation with the titles of the conferences that were actually included in the congress program. (The abstracts were not published in the final program).

However, titles are sometimes changed along the way, or more simply, slight differences are introduced to facilitate indexing when the program is published. If I have to resort to losing abstracts whose titles have been completely reworked, I’d like to catch the ones that are similar. Ideally, I’d need a bit more robust solution than using fingerprint().

While I’m thinking about it, a function for antidictionary() would also be nice. (With help, I managed to build one in Google Sheets a few years ago).

Should a Samaritan revive Yatszhash FuzzyMatch extension?

Here are some other ideas and/or strategies to handle this situation ordered from easy to more complex:

  1. You could create a "key" column from the titles in each of your projects e.g. by using fingerprint or other GREL functions and then use the "key" columns for cross. But as you already stated "fingerprint" might not be enough.
  2. You could use csv-reconcile and use the more fuzzy and user friendly reconciling process in OpenRefine.
  3. You could load both projects into one and use the new custom clustering methods to identify the matching titles.
  4. You could write your own matching functions/algorithms and call them via Jython.

The strategy of course also depends on whether you are able to write and or run Python scripts in your environment.

String Similarity is the problem domain in this case. Almost all of those functions are available in Clustering. Outside of OoenRefine, there are many other tools that might perform better for the fuzzy lookup or fuzzy record linkage that you are specifically needing. To start, take a look at Fuzzy Lookup Add-in for Excel https://www.microsoft.com/en-us/download/details.aspx?id=15011

Thanks Benjamin and Thad for the answers.

Ideally, I'd like to invoke the more advanced clustering algorithms when running cross(). Loading both datasets in the same project might be feasible in this case, and work with a combination of clustering runs and some custom manipulations.

Nevertheless, I'll probably have to go the reconciliation route at some point, as I have a few more cases of similar tasks to complete this summer, and it's piling up.

Is there a more concise way to use stopwords in GREL than: word, (word != "le").and(word != "les").and(word != "la").and...)?

How about using inArray for that.

["le", "les", "la"].inArray("needle")

Hi @b2m,

Not sure how to integrate inArray to accomplish what I’m looking for (in a custom clustering, add to a transformation of the type fingerprint the exclusion of a list of words). As is done here, but more concisely, in the case of a (much) longer list.

Since Friday I did read about this Jython solution, and I might take that route.

But still curious about the articulation of inArray with custom clustering, if it’s doable.

In any case, thank you for the replies!

1 Like

I guess it is easier to first apply fingerprint and then perform the filtering of your stopwords.

In GREL this looks like:

filter(
    fingerprint(value).split(" "),
    v,
    not(["le", "les", "la"].inArray(v))
).join(" ")

But when you have really big stopword lists, then using Jython to load them from a file on disk is also feasible.

1 Like

Actually an elegant solution there. For a short list of stopwords anyway.

Thank you Benjamin!

1 Like