Reconciliation in OpenRefine
This post merges discussions from the Reconciliation in OpenRefine and Using OpenRefine with Wikidata, Wikibase, and Wikimedia Commons sessions. In this post, the annotation wiki* refers to Wikibase, Wikidata, and Wikimedia Commons.
Update on Recent Improvements in the Reconciliation Process
@Ayushi_Rai provided an update on recent developments, tracked in the GitHub project OpenRefine reconciliation workflow improvements:
- The reconciliation dialog has been changed, but the documentation has not been updated yet.
- Formerly, services were part of the main reconciliation dialog.
- Choosing a reconciliation service would automatically hide the selection, making it difficult to find again.
- Now, the selection is a separate step in the dialog.
- You can switch between the main reconciliation dialog and the service selection.
- The number of shown reconciliation results has been reduced to three; more can be shown on demand (clicking "see more").
- Tooltip on the column to show from which service a column has been extended.
- Icon on columns with reconciliation data to show which service is used (still in progress).
Presentation of the Workflow for Wiki* Users
@Michael_Markert shared his experiences using OpenRefine for multi-step reconciliation approaches, which can be complicated when working with several types of entities in one project.
@Alicia_Fagerving from Wikimedia Sweden described their workflow using OpenRefine for OpenGlam projects:
- The GLAM organization provides image files with a spreadsheet of metadata.
- They use OpenRefine to process the data.
- Upload files with Pattypan, now easier with Commons integration.
@HannaMeiners presented two current use cases. In both cases, OpenRefine is used to clean messy data and prepare it for reconciliation.
- Change property IDs in many Wikidata items via QuickStatements.
- Prepare GND import for theatre buildings and enrich them from existing Wikidata entries.
@lozanaross shared their heavy use of Wikibase integration for third-party instances. NFDI4Culture hosts the repository for the reconciliation service and plans to maintain it as a Wikibase extension.
OpenRefine as a Reconciliation Tool Instead of Data Cleansing:
@Michael_Markert mentioned that cleaning for reconciliation is different from regular data cleaning. @Susanna_Anas noted that OpenRefine is not always the best tool for all tasks, leading to switching between programs. @lozanaross emphasized that in the Wikimedia community, OpenRefine is primarily a reconciliation and data upload tool, with data cleaning as a bonus.
Clarifying Roles in the Reconciliation Process
There is often confusion about who is responsible for different parts of the reconciliation process in OpenRefine.
@martin indicated the need to explain to OpenRefine users that OpenRefine, the reconciliation endpoints (scoring, recon speed, etc.), and the datasets are maintained by different organizations. Users often see OpenRefine as the only community responsible for everything, leading to misunderstandings.
@ostephens asked if Wikidata was at the forefront of supporting the protocol. Are there other services now as well? @tfmorris my understanding was that @antonin_d did this independently from Wikidata.
@antonin_d referenced the testbench for details on which services support which features. @martin commented that this is often used as reference point by user to understand which services are available. It is confusing for them because the table was not designed for this purpose. @thadguidry suggested making the header sticky on the table.
@martin suggested that displaying more metadata about each reconciliation service in OpenRefine might help users understand that it is an external service to OpenRefine. Currently, very little is shown about what a reconciliation service is, how to get more info about it, or contact the maintainer.
Entity Reconciliation Community Group
@antonin_d introduced the Entity Reconciliation Community Group, which operates independently of OpenRefine even if there is an overlap in participants. This is a conscious decision to create a standard instead of an OpenRefine specific protocol. Other tools implementing the protocol are listed here: Entity Reconciliation API Census.
@antonin_d presented changes in the latest draft specification, both of which are not yet supported in OpenRefine:
- Improved accessibility, such as one endpoint supporting multiple languages.
- Improved scoring, with services able to expose multiple/more detailed scores.
@Michael_Markert asked if there are already services supporting multiple scores. @thadguidry mentioned that OCLC is considering implementing this, with a team working on the next evolution of Entity Matching. It is part of WorldShare, if I'm not mistaken
@antonin_d invited interested contributors to join the Entity Reconciliation Community Group.
Supporting wiki* communities
Different Wikimedia groups are currently active in OpenRefine in various ways. A large user group works with wiki* using OpenRefine. Often, they mix OpenRefine and wiki* issues into the same question. Their questions cannot be answered solely within OpenRefine; they require knowledge of Wikibase, Wikimedia Commons, or Wikidata. See for example this forum post.
@Susanna_Anas noted that Wikimedia communities try to engage with OpenRefine projects and sometimes struggle. How can we support them? Can we coexist, or do they have to look elsewhere?
The group discussed option to better support them when questions goes beyond OpenRefine's scope:
- Do we invite them to post across multiple channels: wiki* forums, OpenRefine forum, StackOverflow, Telegram?
- How can we attract more Wikimedia experts to the OpenRefine forum? Tag issues as wiki-related?
- Organize continuous events like hackathons with both OpenRefine and wiki* representatives. How do we fund these events?
@ostephens expressed happiness at seeing the Wiki* community adopting the tool but also concern about it overshadowing other use cases, such as general data cleaning versus pipelines for data to wiki*.
Potential Improvements
Scoring and Suggestion
@tfmorris and @Alicia_Fagerving mentioned that people are disappointed if the reconciliation results for their data are not good.
@Susanna_Anas emphasized that reconciliation is the most important function when working with OpenRefine and wiki*, but it often hits the limits of the reconciliation "black box." More information during reconciliation would help users understand and improve the process. The user interface in OpenRefine lacks comprehensive ways to compare reconciliation results.
@thadguidry mentioned "standardized scoring," so a 100% score means an exact match regardless of the service used. @tfmorris noted that this conflicts with many search APIs, where scores are open-ended with no upper bound.
@mack-v noted that suggestions often lack enough detail to make a decision on which suggestion best fits. It would be useful to load a list of properties for all suggestions and allow adding, removing, and ranking properties during the process.
@AtesComp noted that the RDF extension separates reconciliation and data upload.
Reconciliation Workflow
@ostephens noted that you cannot reenter a step of the workflow during reconciliation. Blank cells are not counted in the percentage, and copying reconciliation data is currently not working at all. (Ticket to be created.) @mack-v mentioned that sometimes reconciliation services stop giving results, leaving cells unreconciled. The new error reporting will address this. @Michael_Markert suggested rethinking the whole reconciliation process and UI/UX for partial reconciliation.
Reconciliation Dialog UI
@thadguidry provided some history regarding the initial plan for this feature. The design is ancient, and even the original designer recommended changes. Dialogs are too small and not separate workflow windows, with insufficient user preferences to show/hide elements. There are many old mailing lists and issues with great ideas that haven't been implemented yet. @thadguidry and David Huynh initially designed the reconciliation in a separate side panel that can slide out, with three tabs: Properties, Entities, and Services. A senior designer is needed to put that side panel in place and prototype it.
@Susanna_Anas mentioned that a separate working group for this would be really useful. She open the following conversation on our forum: Working group for reconciliation user interfaces