clean up notes from the pad
Participants shared examples of workflows where OpenRefine is used as part of larger data pipelines.
Julie: HPC resource allocation workflow
Julie presented a workflow for preparing data for HPC resource allocation.
The process involves a data-cleansing step between the XLS files and an allocation script. OpenRefine is used to review and clean the data visually before running the script.
Although the workflow could be implemented entirely in R, OpenRefine is kept in the pipeline because:
- It allows detailed visual inspection of the data
- Small adjustments are needed each year
- Keeping the cleaning step in OpenRefine is simpler than moving everything to R.
Uschi: Library data migration workflow
Uschi uses OpenRefine to convert library data from a legacy system into a parent library system.
The original data is loaded into OpenRefine to identify and review errors before sending corrections back to the source libraries.
Typical tasks include:
- fixing name formatting
- identifying shelf marks issues
- detecting encoding problems
OpenRefine is mainly used as a discovery tool in this workflow:
- using facets and filters
- testing regex filters to detect encoding issues
Edits are not made directly in OpenRefine; instead, issues are reported so they can be corrected in the original systems.
Jan: Phone number formatting
Jan demonstrated a workflow for formatting phone numbers before publishing data.
This involved developing a regex transform expression in OpenRefine.
The discussion also touched on the OpenRefine Recipes page, which contains examples of expressions and workflows: Create Wikitext for Wikimedia Commons uploads
A question raised was how the recipes page should evolve and how large it should become, see related discussions:
Srihari: Web scraping and Wikimedia uploads
Srihari presented the following workflow web scraping → local database → OpenRefine → Wikimedia upload
Data sources include repositories and public websites such as:
- Flickr
- US Navy
- US Army Corps of Engineers
- EU-Lex
- University of Texas Libraries
OpenRefine is used to prepare the data before upload.
One example mentioned was handling ambiguous or incorrect metadata, such as incorrect license information on Flickr.
The pipeline uses n8n.io for orchestration.
It was also noted that OpenRefine offers many clustering and cleanup features that could potentially be useful in other tools if they were accessible through an API.
Benjamin
Benjamin presented several workflows used in archival and research contexts:
-
OCR → data review → OpenRefine → creation of structured data → publication
-
NER → reconciliation → enrichment as linked data
-
Manual data collection → cleaning / deduplication → reconciliation → publication
Related blog posts describing these workflows and projects: