Which reproducibility should we focus on?

As part of the reproducibility project @zoecooper and I are working on improving over the current Extract / Apply dialogs to replay operations. We have identified two broad categories of use cases for this functionality and would like to open up the discussion about which ones to focus on.

The scenarios

1. Replaying entire project histories on new versions of the same dataset

In this scenario, the user wants to re-apply all steps they have used on a project (likely including the import and export steps) to a new version of the same dataset. This new version has likely the same set of columns, probably in the same order.

For instance, the data cleaning project might consist in downloading a CSV from an open data portal and converting it to RDF via the RDF Transform extension. This involves a series of normalization and reconciliation steps. When a new version of the dataset is published, the user could want to re-run the exact same steps to obtain an updated RDF dataset.

This assumes that the user has taken care to use reproducible operations as part of their data cleaning process (editing all cells matching a particular criterion rather than manually editing single cells, for instance).

2. Extracting reusable recipes (or macros) for reuse in other contexts

In this scenario, the user is interested in extracting only a few steps out of their overall workflow, because they form a coherent unit that they want to reuse themselves later on a different project, or share with others.

For instance, the user might have figured out a series of four operations that can reliably transform dates formatted into a specific JSON format into the one that's expected by the Wikibase integration. They often deal with datasets which format dates in this way, so they would like to save this as a custom operation that they can easily trigger from the UI without having to do all four steps manually.

Ideally, they could share this recipe with others by exporting it to a file or sharing it on an online portal.

The visions behind the scenarios

For the first scenario, we would be going more in the automation direction, turning OpenRefine more into a pipeline runner. I would say this is the classical understanding of "reproducibility" in an academic context. The current interactive way to do data cleaning in OpenRefine would become a way to implicitly design cleaning pipelines that could then be run by themselves, without having a human in the loop. A natural follow-up would be to offer running those pipelines without going through the web UI at all, for instance via a command-line tool or library. But we would still aim to have this functionality available from the web UI to make it more accessible.

For the second scenario, we would be enabling users to take more control over the pre-defined operations available in the column menus and augment them with domain-specific ones that matter to them. In a sense, it would enable them to go beyond saving GREL expressions as favorites, by instead saving whole operation sequences as such. To some extent, this could perhaps enable users to do more things without typing any GREL code, by training them to combine basic operations instead. For instance, instead of coming up with GREL expressions to parse JSON and extract subfields out of a payload, one could imagine an "explode JSON" operation that would turn a column of JSON objects into a set of columns (like the JSON importer does), on which the user could then apply further transformations (removing the columns they don't care about and transforming the values they have obtained). If saving series of operations for later reuse is easy and such basic operations are available as alternatives to GREL functions, then this could become a viable alternative to learning GREL.

The features required to enable them

For the first scenario, the main features that are needed are:

  • integrating the import / export steps into the project history
  • providing more (or clearer) stability guarantees about the behaviour of each step of a pipeline, with some versioning

For the second scenario:

  • the ability to prompt the user for specific parameters in a recipe, such as the columns to apply it to
  • making saved recipes easily accessible (for instance from the current column menus, or in an alternative to them)

For both:

  • better error handling when some steps of a recipe fail to apply
  • visualization of a recipe (with a graph-based representation instead of a bare JSON structure)

To me both of those scenarios naturally fit in the reproducibility project and would be in scope for OpenRefine, but because they correspond to quite different goals I think it's worth pondering on which one to prioritize. @zoecooper has been gathering feedback from community members on this topic but I think it's also worth having an open discussion here.

Scenario 2:

  1. I think this is more or less my long asked for issue #109 with Macros,

  2. And agree that the idea then leads to wanting to share their Recipes or Macros (and I cannot find the issue @Martin ?), with their groups or domain orgs allowing load/saving their Recipes into their centralized Recipe repos online. Perhaps even OpenRefine (or some organization, or GitHub repo) hosting a public repo for Recipe loading/saving into OpenRefine clients as opt-in.

I would think that Scenario 2 applies to more users worldwide? In my experience, in how OpenRefine itself was ground up from some bits of ETL design and the disadvantages of easily working with adhoc datasets and better visualization (No Excel Hell) and rollback that ETL designs were lacking to see what steps to put into an ETL pipeline in the first place, such as Scenario 1.

Scenario 1:

I would think that Scenario 1 applies to less users. It's essentially what many existing tools already offer (Jupyter Notebooks, ETL, etc.) I think having Scenario 1 will naturally lead to SOME users needing to pipeline lots of Scenario 2 Macros/Recipes together to reproduce across datasets. So Scenario 2, seems like the first priority to me anyways.

Anyways, that's how I see things and have heard from the groups I've taught to. Perform and visualize small sets of transforms... then use them in a larger pipeline. So will be good to hear or see how many users actually want stronger support of pipeline operations. I haven't seen or heard many.

(Back in 2012, Issue #109 came right after a R-lang meeting where we discussed it's powerful macro/functions Comprehensive R Archive Network ecosystem and folks looking to plugin into OpenRefine's existing transforms)

1 Like

Thanks for starting this discussion. I think it's important to be explicit about what the requirements are on this front. Who are the target users for these scenarios? I think that's important context for the discussion.

While scenario 2 sounds useful, it sounds more like an example of modularity/reusability than reproducibility to me. I suspect many users of this scenario will also want to parameterize and/or edit these saved "macros," which may bring with it some additional requirements. JSON is uniquely unsuited for human consumption, so we may want to look at switching to YAML or some other textual representation for the DSL. I have a prototype of what a YAML solution would look like. Some of the discussion seems to assume building IDE-like capabilities into OpenRefine itself, but that seems like a heavy lift to me.

I think there are a few other useful reproducibility scenarios to consider besides the two mentioned:

0A - run a workflow on identical inputs with the identical version of OpenRefine and get identical results (or a clear error message highlighting what step failed)
0B - run a workflow on identical inputs using a later version of OpenRefine and get identical results (or a clear error message)
3 - export a comprehensive human readable workflow description which covers ALL operations

I think people expect 0A & 0B to work today, but the system isn't built to support it (and we don't test for it). It wouldn't be a heavy lift to make the error handling / reporting more robust, but it's work that needs to be done. Adding a version identifier to the saved workflows might help identify compatibility mismatches. We probably also need to track extensions which are used in the workflow and check for them when things are run again. A decision should be made about whether the operation DSL gets documented (and stabilized) for external use. People may want to try and use it as input into other processes/analyses (I think I've read of people already doing that today). I think even just supporting this level 0 use case would provide useful benefit to users (and bring the system more in line with what they already expect to be true).

#3 is a human reproducibility scenario to provide data analysts and scientists with documentation of what was done in an OpenRefine project.

I see command line running of workflows as a separate capability which brings with it a number of other requirements, so even if it's a "natural followup," I think it deserves to be considered separately. It probably implies a documented, supported, and versioned API among other things. People may also have some assumptions about larger scale data or other "production" level capabilities that they associate with this. Increasing scope later rather than trying to do it all in one go seems like the best approach to me.


My apologies for the delayed update, here’s a roundup of the month:

I worked on sketching, designing, and refining the UI for the pipeline view of the operation history. This included design how and where users may encounter the pipeline, how it complements the list view, and the interface itself.

I also redesigned a new set of operation logos/icons:

On the research side, I spoke to users (thanks for your time Nicolas & Thomas) and iterated on my interview questions for the next round of user interviews this coming month. I’ve also been in touch with a few people from the data journalism world (from Tactical Tech, Bloomberg, etc) to share their insights. Though the focus has been on experienced OpenRefine users, I’m hoping to expand the pool as well.

I’ve also been looking into macros as a possible design solution an storyboarding out user flows to show in interviews.

Both options will improve OpenRefine, but developing the macro is more of a customization project than an enhancement for reproducibility.

The first scenario corresponds to what was proposed in the grant application. Many users have already hacked this process, and an official support would be welcome.

Looking closer, we should carefully define the scope and what we mean by "turning OpenRefine more into a pipeline runner".

  1. If we are considering the potential for an official headless mode, I'd appreciate hearing from @felixlohmeier, given his extensive experience with openrefine-client and openrefine-batch. Felix's approach, which allows developers to integrate OpenRefine into larger scripts handling data retrieval and publishing, is particularly advanced.

  2. Are we considering orchestration capabilities with the option to run on a schedule and send alerts on failure? In that case, the workflow orchestration space is moving fast, and many great open-source solutions are already available. I prefer if OpenRefine nicely integrates with them rather than recreating our own version of it.

I like @tfmorris approach to moving in that direction with smaller, more frequent releases. I would also include:

@thadguidry I know we discussed the option to share recipes in the past, but I cannot locate the issue, discussion, or related document. Offering this level of customization is a great idea, and it can also ease the learning curve for many users by making some complex operations available in one or two clicks. I will add to Allow Users and Extensions to customize a Custom Menu (Tools/Other) area · Issue #109 · OpenRefine/OpenRefine · GitHub what I can recollect from my memory.

1 Like