Any good dataset for reconciliation?

jfaurel · November 28, 2022, 8:15pm

One thing that has always bugged me was teaching reconciliation. I can’t find any good tutorial on reconciliation using OpenRefine and I’m really not experienced, never had to do it for any project. I’m not too sure what would be considered to be a great example of a dataset for reconciliation, so if anyone has something to share, I’d absolutely take it. I’m scheduled to teach OpenRefine again next semester and I’d love to add some good reconciliation content to my course.

dbigwood · November 28, 2022, 9:52pm

What would be a good dataset for your course? One in which everything matched? Or one that had some matches, some possible matches and some that don’t match? The 2nd is more real-world. But the 1st can demonstrate the concept without issues. You could then quickly go on to demonstrate downloading additional columns of data to enrich the dataset. The real-world set could provide an exercise in selecting among the possible matches, ensuring there is no match on the others and creating new items in Wikidata if that is your choice to reconcile against.

Robertgarrigos · November 29, 2022, 8:33am

I had to do a brief presentation of OpenRefine (45 minutes) which ended up being even shorter (20 minutes!). Thankfully, I was ready for that and used a very simple data set I created with four wikidata sandbox items. I reconciled them against P31=Q21281405 (Wikidata internal entity), downloaded additional columns data and pushed some more properties to them.

This is not the best way to show OpenRefine, but the audience, as they told me later, understood the potential of it and was eager for more, which is good.

I used this data set, which, in order to show some text transformation, has the names of the items flipped with a comma (,).

The benefit of using the sandbox items is that you don’t have to worry about the changes you make to them.

ostephens · November 29, 2022, 9:59am

The exercise that I use for reconciliation is to take a list of place names extracted from some library catalogue ‘place of publication’ data and reconcile them against Cities in Wikidata. Once the initial reconciliation is done, I get the participants to check they are happy with the matches and then use the ability to add a column from the reconciled values to add the country in which the place is situated in a new column.

The way my exercises are structured participants may end up with a different list of places to reconcile each time, but with the data being extracted from a place of publication in library catalogue data I know that it’s pretty much guaranteed to have some cities in it (the data I start from is mostly, but not exclusively, books published in the UK)

The advantage of using Wikidata is it’s breadth and that it supports the full range of reconciliation functionality. I don’t currently regularly teach updating Wikidata after having done the reconciliation (I’d say that’s more a Wikidata specific thing than a reconciliation thing) but agree with Robert that if you want to teach that then sandbox items is probably the way to go.

If you are interested the materials I use to teach reconcilation are:

Part 1: In these exercises the participants use an API (not reconciliation) to extract place names from the library data Self-paced exercises 1: Variables, operators, and getting data from online sources - Google Docs
Part 2: In these exercises the participants reconcile the place names against Wikidata and bring in additional data from Wikidata Self-paced exercises 2: Reconciliation and looking up data from other OpenRefine projects - Google Docs

jfaurel · November 29, 2022, 2:45pm

I’d say that for me, a good dataset is one that has most teaching-bang for your data-buck. So I usually like a dataset that is, say, 90% easy and 10% some harder stuff to fix but that are very common issues. In forestry/biology, I’d always look for messed-up dates, typos in words, and homonyms (we use Latin words and some field assistants can be very creative).

Since I’m not too sure what issues most people will encounter while trying to reconcile data, I can’t be very specific in my request. The only thing I tried was to reconcile using the boroughs in my bed bug dataset and it didn’t give me much to play with so I figured there has to be some more exciting way to use reconciliation.

jfaurel · November 29, 2022, 2:49pm

Okay thank you! I’ll have a look into it. Thank you Robert as well for the sandbox idea. I’ve never heard of wikidata sandbox items so I’ll do some reading!

Topic		Replies	Views
OpenRefine 2024 Barcamp:: Reconciliation in OpenRefine Development & Design wikidata , wikibase , reconciliation , wikimedia-commons , barcamp-2024	0	70	July 9, 2024
Seeking Assistance with Reconciliation Dialog Feature Redesign Community Feedback reconciliation , design-chat	8	398	September 8, 2023
OpenRefine and Reconciliation to Wikibase Cloud Support and Helpdesk wikibase	1	181	May 16, 2024
Partial upload of a dataset? Support and Helpdesk reconciliation	4	259	November 30, 2022
Reconciliation not getting automatically matched against type Running OpenRefine wikibase , reconciliation	2	229	December 11, 2024

Any good dataset for reconciliation?

Related topics