Any good dataset for reconciliation?

One thing that has always bugged me was teaching reconciliation. I can’t find any good tutorial on reconciliation using OpenRefine and I’m really not experienced, never had to do it for any project. I’m not too sure what would be considered to be a great example of a dataset for reconciliation, so if anyone has something to share, I’d absolutely take it. I’m scheduled to teach OpenRefine again next semester and I’d love to add some good reconciliation content to my course.

What would be a good dataset for your course? One in which everything matched? Or one that had some matches, some possible matches and some that don’t match? The 2nd is more real-world. But the 1st can demonstrate the concept without issues. You could then quickly go on to demonstrate downloading additional columns of data to enrich the dataset. The real-world set could provide an exercise in selecting among the possible matches, ensuring there is no match on the others and creating new items in Wikidata if that is your choice to reconcile against.

1 Like

I had to do a brief presentation of OpenRefine (45 minutes) which ended up being even shorter (20 minutes!). Thankfully, I was ready for that and used a very simple data set I created with four wikidata sandbox items. I reconciled them against P31=Q21281405 (Wikidata internal entity), downloaded additional columns data and pushed some more properties to them.

This is not the best way to show OpenRefine, but the audience, as they told me later, understood the potential of it and was eager for more, which is good.

I used this data set, which, in order to show some text transformation, has the names of the items flipped with a comma (,).

The benefit of using the sandbox items is that you don’t have to worry about the changes you make to them.


The exercise that I use for reconciliation is to take a list of place names extracted from some library catalogue ‘place of publication’ data and reconcile them against Cities in Wikidata. Once the initial reconciliation is done, I get the participants to check they are happy with the matches and then use the ability to add a column from the reconciled values to add the country in which the place is situated in a new column.

The way my exercises are structured participants may end up with a different list of places to reconcile each time, but with the data being extracted from a place of publication in library catalogue data I know that it’s pretty much guaranteed to have some cities in it (the data I start from is mostly, but not exclusively, books published in the UK)

The advantage of using Wikidata is it’s breadth and that it supports the full range of reconciliation functionality. I don’t currently regularly teach updating Wikidata after having done the reconciliation (I’d say that’s more a Wikidata specific thing than a reconciliation thing) but agree with Robert that if you want to teach that then sandbox items is probably the way to go.

If you are interested the materials I use to teach reconcilation are:


I’d say that for me, a good dataset is one that has most teaching-bang for your data-buck. So I usually like a dataset that is, say, 90% easy and 10% some harder stuff to fix but that are very common issues. In forestry/biology, I’d always look for messed-up dates, typos in words, and homonyms (we use Latin words and some field assistants can be very creative).

Since I’m not too sure what issues most people will encounter while trying to reconcile data, I can’t be very specific in my request. The only thing I tried was to reconcile using the boroughs in my bed bug dataset and it didn’t give me much to play with so I figured there has to be some more exciting way to use reconciliation.


Okay thank you! I’ll have a look into it. Thank you Robert as well for the sandbox idea. I’ve never heard of wikidata sandbox items so I’ll do some reading!

1 Like