I want to match data in a CSV files with wikidata items, that is, find the QID of a matching item if it exists. I use the reconciliation end point and I try would like to restrict the class of item to “literary work” but not “edition or version of a work”; or vice versa, to “edition or version of a work”, precisely, excluding all potential matches of the other type. It seems to me though that OpenRefine, or maybe the endpoint, lumps the two together in its reconciliation attempt. I furthermore try to exclude false positives using the publication date.
Nevertheless I get enormous amounts of useless suggestions from openrefine including items which are published a hundred years after the item I am matching. Likewise, I get items representing an edition when I only want literary works. While this behaviour may be desired in some use cases (for instance, I suspect, when the manual labour to click through potential matches is thought to be cheap) it is not in my case. Is there a way to control these features in more detail? Am I just using them incorrectly?
I also try to match author names to qids and exclude false positives using known publications of the author, including the publication year. But there are again many, many false positives that seem to completely disregard such additional information. Can one somehow set more options to control this feature in a way I do not realize? How exactly are these additional columns used, and how can one interact with the mechanisms?
In reconciliation you have three ways to head the search results in certain directions.
The column you are using for reconciliation has the main influence. This has to match the content of a field in the data you are searching for. Ideally you match on the title of the item (most points), but usually also other fields like synonyms and so on are searched. For example when you search for Wolfgang Mozart, you may also have his parents, wife and sister in the search results, because "Wolfgang Mozart" is a good match for the content in the field of spouse, ...
A sometimes surprising influence is the selection of a type that you use for reconciliation. This (usually!) acts as a filter, and only elements that have this type (or a subtype!) are considered as search results. As the datasources are never perfect your data might not be in the type you are filtering on. So especially with Wikidata you may need quite the domain knowledge on how the onotology behind the data in your domain is structured.
Using additional columns helps you tweak how search results are ranked. This is not a filter. Adding additional columns might help you to move more relevant results to the top and make not so relevant results disappear because they are scored quite low. But you may still have search results that are way out of your search parameters (e.g. in terms of date), but still receive high scores because of other reasons (perfect match of search term e.g. in "influenced by"-field).
When using dates as additional columns, please be aware that the results may be more confusing... the reason is, that you may have a more exact date than the data in Wikidata or Wikidata may have a more exact date.
So when you try to match "1808" to "1808-01-05" in terms of string matching the date "1408-18-08" might receive a higher score. So I usually only use the year part of a date or an explicit range search (if the reconciliation server supports this).