OpenRefine 2024 Barcamp: If only OpenRefine could be more like

This session was a roundtable where participants presented other software from which OpenRefine should take inspiration. See the Barcamp page

Data Wrangler Extension (for Visual Studio Code)

Visual data cleaning in the IDE for Python, bringing the "What you see is what you get" aspect of OpenRefine to developers. It comes with many reproducibility guarantees since it's based on Python.
@thadguidry: Can be overwhelming even for quite technical folks.

Dataiku - AI data prep

In 2014-2015, various start-ups raised funds for ideas similar to OpenRefine (see also Trifacta below). It's interesting to see where they went.
Dataiku uses generative AI to replace GREL and other expression languages. Instead of writing an expression, you write in plain text what you want to do and generative AI takes care of the translation.
All steps are recorded in the history, similar to OpenRefine (likely a big inspiration).
@thadguidry: Makes working with data easy, matches well with the goals for OpenRefine.
@ostephens: How well does it deal with more complicated tasks?
@ostephens: Is the good thing the natural language parsing into expressions, or is there more clever processing behind it?
Many non-technical users struggle with generating even simple GREL expressions. I feel like the natural language parsing for instructions is the most important part here.

Trifacta and Alteryx have also been mentioned in relation to Dataiku as enterprise data preparation tools. Specifically, visualization of column data overlaps with Ydata profiling mentioned below. They offer immediate feedback to users about their data without further actions.
Trifacta is the same technology as Google Data Prep. It is more opinionated than OpenRefine, highlighting potential issues with the data (datatype outliers…).

QGIS

Ways to easily manipulate and visualize spatial data and shapefiles.
Many users work with "shape files," a set of files that work together to produce a visual output.
QUESTION: Isn't "shape files" being slowly eschewed by the industry in favor of GeoJSON?
The transition will take a long while since many actors have loads of historical data as shapefiles.
How do we join multiple shapefiles together? This is an ongoing request in OpenRefine to add data to an existing project (See #715). The limitation with OpenRefine is that you have only one go to create your project. (Similar point for LibreOffice Calc below.)

Libreoffice Calc

Spreadsheet software with CLI capabilities, for combining files for example.
Pre-preparation for OpenRefine. When you have many files with similar data (e.g., 21 counties or 290 municipalities), these cannot be merged in OR (outside of the initial project creation) but must be done in the pre-preparation step.

Microsoft Excel

Office 365 now supports inline Python. This is bringing back people to Excel. However this feature requere Cloud so there is some privacy concernd. It is worth to note that MS PowerTools adds OpenRefine-like functionality to Excel.

Postman

@thadguidry: I love the idea of making our Fetch have many of the features of Postman.
This would also help even more with JSON-LD and linked data in general.
Fetch URLs could be extended (or a separate extension).
Tool for interacting with web APIs.
Relates to OpenRefine add column based on URLs, parsing JSON, import data from a remote data source (fetching).
When you make a request, you can set variables.
You can import a CSV of data and use it when calling an API; Postman stores the CSV.
Useful for users using OR as a web scraping tool.
But a very technical tool; you need to get used to another scripting language.
Supports recipe sharing.

Open Data Editor

Grant-supported.
Targets governments who want to publish and share data.
OKFN will be at Wikimania.

Zulip

Great contributor documentation as an example.
PDF version of the doc is useful (@jfaurel) -(see ticket #275.
@thadguidry: I like the categorization; we can definitely take inspiration from those sub-categories.
@ostephens: Would like to see evidence that the documentation is actually helping.

Protege

Used in the context of the semantic web, open source project.
Standalone and web-based application, supports extensions. Built in Java - can give us good ideas on how to approach the extension plugin.

Ydata profiling

Quick reporting about a dataset qualities. Very interesting for data exploration.
Python lib used for Panda DataFrame.
Import a table and print a report with basic features of the data.
It is like generating facets for all fields in a project automatically and presenting the results in an HTML/PDF report, including visualizations fitting the data type of the facet.
@thadguidry: I've used more and more Polars.

MessyDesk

Visual workflow editor to visualize the pipeline.

JupyterNotebook

Example of literate programming.
Supports different kernels (Python, Julia, R, ...).
Combines text (Markdown) with code and its output.
Better collaboration environment might be Collab, but there is JupyterHub.

Text editor, IDE, and command line tools

Sometimes this is faster than OpenRefine for sort, de-dup, edit for smaller lists/data sets. It's really difficult to do the really simple things in OpenRefine like deduplicate a list

Datasette

Focus on visualizing and exploring data.
Author comes from a data journalism background, and examples are often in this area.
Example: Global Power Plants Dataset.

Tabula

Extract tabular data from PDFs.
Last release 2018.
Could be nice if OpenRefine supported PDF as import and we try exporting tables from it.

Antelope

  • Proposed by: @lozanaross
  • URL: Antelope and Antelope Service.
    Connect data about entities to multiple ontology/vocab sources and let the user pick the right one (with some ML-automation for the suggestions as well).
    Looking to have an Antelope extension for OpenRefine; first working on a Wikibase extension.

TS4NFDI

The idea behind this software is to reconcile against different vocabularies/systems at the same time to find any identifiers, not the one in GND XOR Wikidata XOR ...