Both options will improve OpenRefine, but developing the macro is more of a customization project than an enhancement for reproducibility.
The first scenario corresponds to what was proposed in the grant application. Many users have already hacked this process, and an official support would be welcome.
Looking closer, we should carefully define the scope and what we mean by "turning OpenRefine more into a pipeline runner".
-
If we are considering the potential for an official headless mode, I'd appreciate hearing from @felixlohmeier, given his extensive experience with openrefine-client and openrefine-batch. Felix's approach, which allows developers to integrate OpenRefine into larger scripts handling data retrieval and publishing, is particularly advanced.
-
Are we considering orchestration capabilities with the option to run on a schedule and send alerts on failure? In that case, the workflow orchestration space is moving fast, and many great open-source solutions are already available. I prefer if OpenRefine nicely integrates with them rather than recreating our own version of it.
I like @tfmorris approach to moving in that direction with smaller, more frequent releases. I would also include:
- Improving the template export to make it part of the project history as listed under the export template GitHub label
- Easily refactor the JSON operation script. For example, when I rerun my operation on dataset B, I can easily edit steps to prevent errors early in the execution. The refactoring can include changing a facet, editing a GREL, or adding a cleaning step to handle new scenarios.
- Regarding the ease of editing the steps, there are other points covered in User Interviews Results Part 2: Exploring Feedback Regarding OpenRefine Feature and User Experience
- Better error handling so operations don't fail silently