Which reproducibility should we focus on?

tfmorris · May 3, 2024, 5:38pm

Thanks for starting this discussion. I think it's important to be explicit about what the requirements are on this front. Who are the target users for these scenarios? I think that's important context for the discussion.

While scenario 2 sounds useful, it sounds more like an example of modularity/reusability than reproducibility to me. I suspect many users of this scenario will also want to parameterize and/or edit these saved "macros," which may bring with it some additional requirements. JSON is uniquely unsuited for human consumption, so we may want to look at switching to YAML or some other textual representation for the DSL. I have a prototype of what a YAML solution would look like. Some of the discussion seems to assume building IDE-like capabilities into OpenRefine itself, but that seems like a heavy lift to me.

I think there are a few other useful reproducibility scenarios to consider besides the two mentioned:

0A - run a workflow on identical inputs with the identical version of OpenRefine and get identical results (or a clear error message highlighting what step failed)
0B - run a workflow on identical inputs using a later version of OpenRefine and get identical results (or a clear error message)
3 - export a comprehensive human readable workflow description which covers ALL operations

I think people expect 0A & 0B to work today, but the system isn't built to support it (and we don't test for it). It wouldn't be a heavy lift to make the error handling / reporting more robust, but it's work that needs to be done. Adding a version identifier to the saved workflows might help identify compatibility mismatches. We probably also need to track extensions which are used in the workflow and check for them when things are run again. A decision should be made about whether the operation DSL gets documented (and stabilized) for external use. People may want to try and use it as input into other processes/analyses (I think I've read of people already doing that today). I think even just supporting this level 0 use case would provide useful benefit to users (and bring the system more in line with what they already expect to be true).

#3 is a human reproducibility scenario to provide data analysts and scientists with documentation of what was done in an OpenRefine project.

I see command line running of workflows as a separate capability which brings with it a number of other requirements, so even if it's a "natural followup," I think it deserves to be considered separately. It probably implies a documented, supported, and versioned API among other things. People may also have some assumptions about larger scale data or other "production" level capabilities that they associate with this. Increasing scope later rather than trying to do it all in one go seems like the best approach to me.

Tom

Topic		Replies	Views
Reproducibility project: new approach Development & Design	5	35	August 27, 2024
Reproducibility project: December 2024 report Day-to-day project operations	1	23	January 2, 2025
Reproducibility project: January 2024 report Day-to-day project operations	2	151	February 6, 2024
Reproducibility project: February 2025 report Day-to-day project operations	0	13	March 4, 2025
Reproducibility project: October 2024 report Day-to-day project operations	0	16	November 7, 2024

Which reproducibility should we focus on?

Related topics