Configuring auto-match with openrefine wikibase reconciliation

In OpenRefine when reconciling I can select "Auto-match candidates with high confidence". This does not really work for me, as it confidence value varies between different data sets that I'm matching.

What I need is to be able to manually set the value for when it should auto-match, each time i reconcile. If I could set this value, I'd be able to run the reconciliation process a couple time to adjust the value to one that works to auto-match most entities.

One of the most common cases I'm dealing with at the moment is company names where one of the data sets includes "Inc" or "Ltd" in the company name where the other does not. I need to be able to match all of them automatically, because there is a lot of them and clicking "Search for match" on each one in OpenRefine is not feasible.

The other issue I have is the scoring mechanism. I need to be able to adjust the weight for each scoring features.

A*name-matching+B*identifier-matching+C*date-matching+D*quantity-matching. Where the weight A,B,C,D are user specified. For C and D I should ideally be able to set the weight for individual properties, i.e. if person had a "birth date" and a "married date", I could choose to weigh the birth date more, although this is not strictly necessary.

The most important is to set the threshold for when to auto-match entities. I'm not sure if that is possible to set somehow with OpenRefine, or if that would require it to be added to the "Reconile column..." box in the UI? What I'd need is really just a text form next to the "Auto-match candidates with high confidence" ticker, where the threshold value could be set, similar to the textbox below where you can set the "Maximum number of candidates to return".

Weighing my options here. I'm running the wikibase reconciliation service locally on my computer, so I can modify it if need.I'm wondering about maybe just calling that directly from a jupyter notebook instead of using OpenRefine if this is hard to do with OpenRefine itself for now.

And for the wikibase reconciliation service, would adding such user specified weighing require modifications to the code, or is it supported with some parameters?

"Conditions of a Match" are definitely something I would like OpenRefine to surface more to our users. Part of this involves getting Services to expose more, and conversely enhancing OpenRefine to allow users to inspect Features, their values, their weights, and make adjustments themselves interactively for potential candidates to match.

We have actively talked about such user weighting in the W3C Entity Reconciliation group.
In general, OpenRefine ideally would expose more of a Service's features and it's own weighting, which would allow users to use Recon Facets to make adjustments to filter out or narrow the best candidates they need. Sometimes it's based on Property matches and a user might manually make adjustments to Property weights through Facet sliders expose things like Type Strictness, or particular context values of a Services' Features that might be returned from a Service and Faceted upon.

Hi @matssk,

Thanks a lot for bringing up those usability issues, which I really sympathize with.

What I need is to be able to manually set the value for when it should auto-match, each time i reconcile. If I could set this value, I'd be able to run the reconciliation process a couple time to adjust the value to one that works to auto-match most entities.

One way to achieve this is to disable auto-matching during reconciliation, then use the best candidate's score facet to select the interval in which you are confident of the matching quality and then use the "Match cells to their best candidates" action.

The other issue I have is the scoring mechanism. I need to be able to adjust the weight for each scoring features.

This a limitation of OpenRefine I am painfully aware of. As @thadguidry mentioned this is a topic we have been working on in the group that coordinates changes to the protocol, but we haven't updated OpenRefine to expose individual reconciliation features yet.

The Wikibase reconciliation service does expose individual features, so you could exploit them from another client. For instance, say you are making a reconciliation query to find a person called "Joao Gilberto" who plays the guitar, on Wikidata. The query could look like this one and you can see the fitness of each candidate is graded separately with a score quantifying how well the label matches (all_labels) and how well the instrument playing property matches (P1303). This is something you could already exploit with another client. If you go down that route, then I hope the documentation of the protocol and the test bench will be helpful. I'd be keen to hear about your experience in the reconciliation group.

In the future I would like to adapt OpenRefine so that it is able to store such features and expose them to users, so that they are able to inspect them and re-score the candidates with different weights as needed. This GitHub issue summarizes it: Store and expose reconciliation candidate features · Issue #3139 · OpenRefine/OpenRefine · GitHub

I'm running the wikibase reconciliation service locally on my computer, so I can modify it if need.

You can certainly exploit that to tweak the scoring mechanism in the service directly. It's not exactly convenient but depending on your needs, it might be the easiest solution.

Thanks this is really useful. I will explore disabling auto-matching during reconciliation and using the score facet and I will try out using the wikibase reconciliation service directly with Jupyter notebook for exploring the individual features as you described.

The only thing I miss now I think is using embeddings as a scoring mechanism for the description field, as described in a previous topic. I.e. that you can configure an embedding service to use, either local on the cloud, then for the Description value, calculate an embedding score for both entities and do a cosine similarity to calculate the scoring value. I guess it might turn out that comparing the embedding scores of other descriptive string fields could also be useful, so it shouldn't be hard-coded to the description property. I'll look a bit more into how to do this. Looks like most of the relevant code is in wikidatavalue.py and engine.py