Reconciliation restricted to classes and subclasses

I want to match data in a CSV files with wikidata items, that is, find the QID of a matching item if it exists. I use the reconciliation end point and I try would like to restrict the class of item to “literary work” but not “edition or version of a work”; or vice versa, to “edition or version of a work”, precisely, excluding all potential matches of the other type. It seems to me though that OpenRefine, or maybe the endpoint, lumps the two together in its reconciliation attempt. I furthermore try to exclude false positives using the publication date.

Nevertheless I get enormous amounts of useless suggestions from openrefine including items which are published a hundred years after the item I am matching. Likewise, I get items representing an edition when I only want literary works. While this behaviour may be desired in some use cases (for instance, I suspect, when the manual labour to click through potential matches is thought to be cheap) it is not in my case. Is there a way to control these features in more detail? Am I just using them incorrectly?

I also try to match author names to qids and exclude false positives using known publications of the author, including the publication year. But there are again many, many false positives that seem to completely disregard such additional information. Can one somehow set more options to control this feature in a way I do not realize? How exactly are these additional columns used, and how can one interact with the mechanisms?

2 Likes

As I suspect from other posts you are German speaking, so I can recommend the following tutorial for you: Workshop - Erweiterter Abgleich mit Wikidata | FDMLab@LABW

In reconciliation you have three ways to head the search results in certain directions.

  1. The column you are using for reconciliation has the main influence. This has to match the content of a field in the data you are searching for. Ideally you match on the title of the item (most points), but usually also other fields like synonyms and so on are searched. For example when you search for Wolfgang Mozart, you may also have his parents, wife and sister in the search results, because "Wolfgang Mozart" is a good match for the content in the field of spouse, ...

  2. A sometimes surprising influence is the selection of a type that you use for reconciliation. This (usually!) acts as a filter, and only elements that have this type (or a subtype!) are considered as search results. As the datasources are never perfect your data might not be in the type you are filtering on. So especially with Wikidata you may need quite the domain knowledge on how the onotology behind the data in your domain is structured.

  3. Using additional columns helps you tweak how search results are ranked. This is not a filter. Adding additional columns might help you to move more relevant results to the top and make not so relevant results disappear because they are scored quite low. But you may still have search results that are way out of your search parameters (e.g. in terms of date), but still receive high scores because of other reasons (perfect match of search term e.g. in "influenced by"-field).

When using dates as additional columns, please be aware that the results may be more confusing... the reason is, that you may have a more exact date than the data in Wikidata or Wikidata may have a more exact date.

So when you try to match "1808" to "1808-01-05" in terms of string matching the date "1408-18-08" might receive a higher score. So I usually only use the year part of a date or an explicit range search (if the reconciliation server supports this).

1 Like

Hi y’all,

I also would love to have better control on the reconciliation (or at least better understand why there is sometimes totally wrong results when better results are available).

Very often, I have to double-check the reconciliation for things that I knew were obviously not what I wanted. Typically adding a column “give me the P31 of the reconcilied value” and filtering out values that are not what I asked. It feels a bit “two steps forward, one step back”…

And I too had the case of looking for “edition or version of a work” but having results not in that class like “litterary work” (and vice versa).

Same also for the additionnal clues, I wish there had been a way to say “yes, *really* trust this column”. Especially for identifiers (like ISBN), I often end up calling the API directly for better results.

This is something that is under the control of the reconciliation service, not OpenRefine. Reconciling literary works is complicated by the fact that they tend to be somewhat inconsistently modeled/recorded in Wikidata. You'll find entries which are both editions and works or works with ISBNs and publishers, etc.

A trick that sometimes helps for authors is to create a column containing the string "writer" and match it to the property "occupation" but, again, the actual range of values which are used for authors can be quite varied.

Books and authors are the main things that I typically reconcile to Wikidata, so I'm motivated to improve this, but I haven't had time to dig into it. The fact that the Wikidata reconciliation service is currently unsupported doesn't help matters.

Tom

1 Like

Is there really nothing to do on the OpenRefine side? (at the very least a warning/explanation that the reconciliation will not always do what is expected)

And if not, how can I learn more about and fix the reconciliation service?

I know Wikidata data are often messy, it’s actually what I’m often trying to solve but sadly I often can’t use OpenRefine to fix the mess (or with a lot of steps which is tedious). This is a bit annoying because I use OpenRefine to fix a lot of other things in Wikidata (I have now +6 million edit there), even more frustrating is not understanding what is happening.

About ISBN (and all identifiers), we already have a function Use values as identifiers, could we have something like Use values as identifiers inside an item? (currently I’m doing it by fetching via a call to the search API, which is a bit ugly and inefficient ; it’s a bit like the Property paths in https://wikidata.reconci.link/).