Reconciliation to a subset of WD

Is there a way to tell Open Refine just to reconciliate to a subset of Wikidata e.g. people from the Swedish Parlament


  VALUES ?member {
    wd:Q33071890 
    wd:Q81531912 
    wd:Q82697153 
    wd:Q10655178 
  }
  ?person wdt:P39 ?member;

I think this question is quite interesting as I was wondering how to do that myself but did not take the time to find out how. You need to use some "hidden" features of OpenRefine and the scoring algorithm from Wikidata for that. I added the links where these are described at the end.

This is how I got it working in OpenRefine 3.6:

  1. In OpenRefine add a new column with the different names of the swedish parliament you are searching for.
    or-recon-01-start

Note that you need the names and not the QIDs.
Note that you need to use the record mode, so that the different names will basically get connected via a logical OR. If you'd need a logical AND you would need to use a different column for each name.

  1. Reconcile against Wikidata using the column with the different names of the swedish parliament as additional property with P39 as property type.

  2. Members of the listed parliaments should receive higher scores than other people in the wikipedia.
    or-recon-03-result

Here are some sources with background information:

2 Likes

Thanks I will test it

I feel as WD get bigger this ”subset” functionalty is getting more important. The interesting thing with members of the Swedish Parliament is that early 1900 we got more people who were part of the social democrats and they mostly had “common” names based on patronymicons like “the son of Anders” = Andersson —> the “standard” Open Refine reconciliation is a mess and the new Wikicommons add-on were you can’t preview uploaded pictures don’t help you

See feature request of preview —> File preview before uploading to Wikicommons · Issue #5594 · OpenRefine/OpenRefine · GitHub

If you want to reconcile against a small enough subset of Wikidata, you could consider writing a SPARQL query defining the set of such entities, download the results and put it in csv-reconcile or reconcile-csv, which both let you run a reconciliation endpoint off a CSV table. That would ensure you get matches from this subset only.

The default Wikidata recon service is pretty useless when it comes to reconciling people, since the type system cannot be used to filter by occupation. Perhaps the “occupation” property could be treated just like “instance of” by the reconciliation system, so that one could filter more accurately. It would be a fairly easy change to make.

There is also a dedicated recon service for people, run by Ontotext, that you can add with the URL “https://reconcile.ontotext.com/people”.

2 Likes

I'm not sure that I agree with this assessment. I successfully reconcile people with Wikidata all the time using the default reconciliation service. I tend to address the above problem by adding columns for occupations / countries of citizenship, reconciling these first, and then adding these as additional properties to be taken into account while reconciling. Works fairly OK for me most of the time, although of course it's not perfect.

A main hurdle to more efficiently work with the WD recon service is IMO really UI related: one needs to use the mouse pretty awkwardly and hover and click a lot to properly disambiguate people (or generally entities) with similar names and to identify and select the right one.

I've tried the Ontotext recon service for people and I had bad experiences with it:

  • In the cases where I tried it, it reconciled to very wrong names with 100% confidence (large percentage of absolutely wrong matches, pretty unacceptable IMO)
  • It only returns three suggested results when matching is uncertain
  • Search doesn't function properly
  • It's really only people, and usually datasets are dirtier (there will very often be some stray collectives and organizations in the mix)
4 Likes

Off topic: the lobid-gnd API for OpenRefine allows to use parts of the query string syntax from ElasticSearch.

In the context of searching for people having several columns with information (birthdate, occupation, …) this is realy awesome. You can also declare date ranges or perform fuzzy matching… so far this gave me the best user experience when performing reconciliation tasks.

I wrote a (german) tutorial on how to use the lobid-gnd API with OpenRefine last year.

2 Likes

Interesting discussion altogether, and interesting workarounds.

I would like to point our one additional limitation of the reconciliation service, which is related to these. The reconciliation type only considers P31, not P279. Therefore, it is impossible to find an obvious match for a subtype of something.

Another comment: I agree that the reconciliation UI cannot tackle to the breadth of data on Wikidata and needs severe overhaul. It would be interesting to know what the chances are for creating custom interfaces, would they need to live in a completely separate environment and recreate all the other existing functionality, or would there be ways to create UI extensions to OR?

| Susanna_Anas
December 22 |

  • | - |

I would like to point our one additional limitation of the reconciliation service, which is related to these. The reconciliation type only considers P31, not P279. Therefore, it is impossible to find an obvious match for a subtype of something.

The Wikidata reconciliation service seems to have code to do this. Does it not work in practice? Do you have an example which could be used to investigate?

Tom

Maybe we should make a separate thread of this. It is possible I don't know how to make use of the code, but in general, when a type is chosen, instances of that type are presented, not subclasses. I will create an example soon.