Operation history research

Hi everyone!

As we move along in the design process for Operation History within the tool, I wanted to a) give an update and b) request input.

At the moment I’m collecting research on how various tools design their Operation History functions, and I’m casting a wide net. I’ve been taking notes on the user journeys/interfaces of spreadsheet tools (Excel, Google Sheets) as well as creative editing tools that are quite different: photo editing tools (ex. Photoshop) and advanced writing tools (ex. Scrivener - a favorite tool of many journalists I speak to). The ore types of workflows I look out, the more design ideas I’m able to bring to the table.

As I go through each tool, I’ve been sorting out - what are the assumptions, needs, and capabilities the History functionality of each tool is meant to address? How can broadening our perspective help us ask better questions to better address the needs of experienced OpenRefine users, and less experienced, and perhaps less technical users we’d like to attract?

Our first user research conversations will take place with experienced OpenRefine users, and go from there.

I’d love more suggestions for tools to look at.

Here’s my list so far:

Spreadsheet:
Google Sheets
Excel
Libre Office
LibreOffice Calc

Data Viz:
Talend
Tableau

Image:
Photoshop
GIMP

Writing:
Scrivener
Word

Right. So it might also be helpful to think of OpenRefine's operation history as a recording of workflow steps. Or alternatively think of it as "event history". Both terms are useful to also search for other similar software that record events, let users browse and filter those events, etc.

Some tools incorporate workflows (these are typically in business, factory automation, science) where processes are described (BPMN typically) and then jobs run those processes (ETL software like). Some have concepts of revert (also called rollback which is directly the concept of what OpenRefine is based on - no real work performed to undo, instead just go back to the data index and re-read its state directly from disk). I cannot recall any having a concept of revert some steps. This is because they usually operate on external systems like databases and not local memory or disk. Many however start with a different premise than OpenRefine has... immutable data.

Here's a few that I am aware of and have used a few as well.

Apache Nifi - it's data provenance feature is event based, but no undo
Airflow - useful orchestrator, but no undo directly.
Camunda | Software for Process Orchestration - not event based, no undo

Having said all that... I think the real question is if Reproducibility is also a concern, then we might not look at interfaces of viewing/filtering history alone, but also those that can show scaffolding/building of a workflow yet to be performed and not even having an operation history yet. In a way, OpenRefine should surely have both.

I see that the research results have been posted, so perhaps this is too late, but for what it's worth:

At the moment I’m collecting research on how various tools design their Operation History functions, and I’m casting a wide net. [...] The ore types of workflows I look out, the more design ideas I’m able to bring to the table.

There's a vast gulf between "operation history" and "workflow," so it's important to be clear about which you are focused on. Although it's not supported today, there's been a strong push to extend OpenRefine to support full fledged reusable workflows (and some are already marketing it as having that capability). Obviously that's a much more complicated topic than a simple undo/redo capability like Google Sheets/Docs.

If you decide that you want to include workflow languages, some things to consider include:

Bioinformatics workflow languages:

  • Common Workflow Language (CWL)
  • Nextflow
  • WDL
  • Snakemake

Graphical Languages / UIs

  • Galaxy (bioinformatics)
  • KNIME (machine learning)

General Workflow Languages

  • SQL (used as transformation language of data pipelines)
  • Apache Airflow DAGs
  • dozens of academic implementations (workflow languages are a very popular thing to invent and write a paper about)

Cloud-based:

I was head of product for an open source software company that built and supported workflow management systems for bioinformatics, so I know a lot about that space and am happy to discuss it in depth, but it's got quite a different design focus than OpenRefine does, although there are common crossover topics like desire for strong provenance, binary vs source reproducibility, etc.

Tom

Hi Tom,

There's a vast gulf between "operation history" and "workflow," so it's important to be clear about which you are focused on. Although it's not supported today, there's been a strong push to extend OpenRefine to support full fledged reusable workflows (and some are already marketing it as having that capability). Obviously that's a much more complicated topic than a simple undo/redo capability like Google Sheets/Docs.

I think that when @zoecooper wrote "The more types of workflows I look out", she used the term "workflow" in a broad sense: a series of steps a user takes to do something with a tool. You might have understood it in a more specific sense (a data cleaning pipeline composed from various operations combined together). For instance, when I do my laundry, I use a certain workflow: it's a legitimate use of that term too, I think.

I know you oppose using the term "workflow" for "a series of OpenRefine operations" because the current functionality of the tool does not provide the reproducibility guarantees that one would expect of a tool that lets users design data cleaning workflows. The point of the reproducibility project is precisely to reduce this gap, with the knowledge that it will not go away entirely. That does not mean that we have to use the term "workflow" to talk about a series of OpenRefine operations after those improvements, as there are many other candidates such as "recipe", "macro", "pipeline"… Examining the language that other tools use was also a goal of this desk research.

I think the other tools you mentioned could be worth exploring as well, but they are aiming at a significantly tech-savvier audience. Personally, I am in favor of focusing on more approachable tools that our existing user base is more familiar with. The ambition of this project is to bring reproducibility to the current OpenRefine users, who appreciate it in large for its "no-code" or "low-code" aspects. As such, I think there isn't so much room to imitate Apache Airflow or even KNIME.

To summarize: for me the goal of this project is to keep OpenRefine a as a tool where you do data cleaning interactively, in a point-and-click way, such that it looks approachable to people coming from Excel. My hope is that after having done those interactive data cleaning steps, the undo/redo history can be interacted with in a way that's a bit more like a pipeline. A pipeline that users would not have designed consciously, but that they would be able to understand and relate to, giving them more of the benefits of programming without requiring them to put themselves in the mindset of a programmer.

1 Like

Thanks for the additional details. That's very helpful. Just to be clear, I don't have any problem with developing an OpenRefine workflow language/processor or even using the word "workflow." I just object to people overselling what is there, encouraging users to put their valuable data at risk. Small scale interactive use of the operation history on data sets of exactly the same shape is probably (mostly) fine, but folks who are running things using alternate clients or doing large scale unattended processing are putting their data at risk of silent data corruption, which is kind of the antithesis of reproducibility. I certainly would never let a client use it in production.

Some of the things that people have asked for in the past include programmatic running of scripts, run scripts across large numbers of files, parameterization of scripts to handle different column names/order and other variability. Are these all out of bounds? What about more advanced topics like composability (subworkflows), comparing versions/variants of scripts, etc?

Zoe - Another category of tools came to mind, which I should have thought of before, namely keyboard macros as available Microsoft Word/Excel, emacs, and a variety of other editors. Sometimes these are back actually recorded using a programming language like Lisp (emacs) or Visual Basic (MS products) which can then be edited, reused, but even simple keyboard macros can typically be saved in a library, given a name, etc.

Tom