As part of the reproducibility project @zoecooper and I are working on improving over the current Extract / Apply dialogs to replay operations. We have identified two broad categories of use cases for this functionality and would like to open up the discussion about which ones to focus on.
The scenarios
1. Replaying entire project histories on new versions of the same dataset
In this scenario, the user wants to re-apply all steps they have used on a project (likely including the import and export steps) to a new version of the same dataset. This new version has likely the same set of columns, probably in the same order.
For instance, the data cleaning project might consist in downloading a CSV from an open data portal and converting it to RDF via the RDF Transform extension. This involves a series of normalization and reconciliation steps. When a new version of the dataset is published, the user could want to re-run the exact same steps to obtain an updated RDF dataset.
This assumes that the user has taken care to use reproducible operations as part of their data cleaning process (editing all cells matching a particular criterion rather than manually editing single cells, for instance).
2. Extracting reusable recipes (or macros) for reuse in other contexts
In this scenario, the user is interested in extracting only a few steps out of their overall workflow, because they form a coherent unit that they want to reuse themselves later on a different project, or share with others.
For instance, the user might have figured out a series of four operations that can reliably transform dates formatted into a specific JSON format into the one that's expected by the Wikibase integration. They often deal with datasets which format dates in this way, so they would like to save this as a custom operation that they can easily trigger from the UI without having to do all four steps manually.
Ideally, they could share this recipe with others by exporting it to a file or sharing it on an online portal.
The visions behind the scenarios
For the first scenario, we would be going more in the automation direction, turning OpenRefine more into a pipeline runner. I would say this is the classical understanding of "reproducibility" in an academic context. The current interactive way to do data cleaning in OpenRefine would become a way to implicitly design cleaning pipelines that could then be run by themselves, without having a human in the loop. A natural follow-up would be to offer running those pipelines without going through the web UI at all, for instance via a command-line tool or library. But we would still aim to have this functionality available from the web UI to make it more accessible.
For the second scenario, we would be enabling users to take more control over the pre-defined operations available in the column menus and augment them with domain-specific ones that matter to them. In a sense, it would enable them to go beyond saving GREL expressions as favorites, by instead saving whole operation sequences as such. To some extent, this could perhaps enable users to do more things without typing any GREL code, by training them to combine basic operations instead. For instance, instead of coming up with GREL expressions to parse JSON and extract subfields out of a payload, one could imagine an "explode JSON" operation that would turn a column of JSON objects into a set of columns (like the JSON importer does), on which the user could then apply further transformations (removing the columns they don't care about and transforming the values they have obtained). If saving series of operations for later reuse is easy and such basic operations are available as alternatives to GREL functions, then this could become a viable alternative to learning GREL.
The features required to enable them
For the first scenario, the main features that are needed are:
- integrating the import / export steps into the project history
- providing more (or clearer) stability guarantees about the behaviour of each step of a pipeline, with some versioning
For the second scenario:
- the ability to prompt the user for specific parameters in a recipe, such as the columns to apply it to
- making saved recipes easily accessible (for instance from the current column menus, or in an alternative to them)
For both:
- better error handling when some steps of a recipe fail to apply
- visualization of a recipe (with a graph-based representation instead of a bare JSON structure)
To me both of those scenarios naturally fit in the reproducibility project and would be in scope for OpenRefine, but because they correspond to quite different goals I think it's worth pondering on which one to prioritize. @zoecooper has been gathering feedback from community members on this topic but I think it's also worth having an open discussion here.