Reproducibility project: new approach

Hi all,

I'd like to propose a new approach for the reproducibility project: implementing the planned improvements as lightweight changes on top of the existing architecture.

I would do this in three broad steps (all of which will require a bunch of pull requests on their own):

  1. Add metadata to existing operations, in a backwards compatible way, so that they expose their expectations on the table structure (input columns, overlay models required) as well as the expected impact they have on the grid (columns created or deleted, preservation of record boundaries). This metadata is speculative, in the sense that the architecture does not validate that the operation fully respects it, but it should be sufficient to enable the following user-facing improvements
  2. Implement a better visualization of the operation history, following the existing designs. The metadata exposed earlier can be used to show the operation dependencies and the operation icons can help understand the structure of the history in a more visual way
  3. Implement better error-reporting and validation to apply recipes, also based on the metadata introduced earlier.

It's not going to be as robust as what I was planning to implement with the new history architecture, because with this approach there is no guarantee that operations indeed do what they announce, but I think that it's still going to bring useful improvements to users. After all, there is currently also no guarantee that undoing an operation indeed brings you back to the state you had before applying it, but the bugs that causes are rather rare. It also eases the pressure by removing the dependency breaking changes, making it easier to implement this as a series of small changes on top of our existing architecture. The more ambitious implementation I have worked on last year could still be used in the future, when we have a clearer idea of how we want to schedule its implementation with respect to other breaking changes and improvements to the extensions system.

What do you think?

1 Like

Add metadata to existing operations, in a backwards compatible way, so that they expose their expectations on the table structure (input columns, overlay models required) as well as the expected impact they have on the grid (columns created or deleted, preservation of record boundaries). This metadata is speculative, in the sense that the architecture does not validate that the operation fully respects it, but it enables a better error reporting ahead of applying a series of operations, but it should be sufficient to enable the following user-facing improvements

I'm very supportive of this idea, especially if it could be done in a way which would be utilized/extended by extensions. The major use case on my end would be to connect users with operations so that one can see who did what.

Those sound like useful improvements. I think better error checking/handling/reporting is the most valuable piece if you want to promote the operation history as a generally usable recipe function. My, perhaps naive, impression is that using one of the unsupported clients is a relatively popular way to apply recipes. If that's actually the case, I'd strongly consider adding versioning to the protocol and making it more robust as an important prerequisite. Lastly, JSON is, in my opinion, an unacceptable base upon which to build a DSL that's intended to be human editable. I would suggest YAML would be a better starting point. I can resurrect my PoC to show that.

I recognize that some or all of this may not fit into the available funding and/or the way the grant was written, but addresses the general topic of "reproducibility."

Tom

1 Like

Perspective, and Reminder of what's really important to our users:

Client library usage is extremely unpopular.
Client library usage was only 3% of all plugin usage from all 207 respondents in the 2022 survey, basically less than 6 folks out of 207 surveyed.

Reproducibility or concerns of improving, modifying, or replaying history, or general workflow feature improvements was only asked for by 1 respondent, which was me, back in 2010 when I asked David if we could support undoing a single change in the history Selective undo · Issue #183 · OpenRefine/OpenRefine (github.com).

I've never seen data or surveys or other folks' names that were asking for improvements in how Undo/Redo currently works (other than adding a warning) or in better support for repeatable workflows?

What users wanted was a warning to not shoot themselves in the foot and lose forward history unintentionally. We gave them that in Add dialog to warn of history entry deletion by wetneb · Pull Request #6659 · OpenRefine/OpenRefine · GitHub

Generally, as evidenced in the survey, folks continue to want

  1. Large data support
  2. Easier UX and QOL improvements (Drag and Drop column reordering, Append Rows, Guides, Easier GREL or Menus for often used complex RegEx patterns, etc.)

In regards to your new approach ideas @antonin_d ,

I have but one question, why would the metadata be speculative and the architecture not know for sure if an operation respects "it"? What is "it" in that sense, the metadata? That seems oddly worded because I would think this way, that for each operation we would have metadata that shows all the dependencies of the operation (GREL, operations, columns, facet selections). But that would not be the case you are saying and some of the dependencies would be guessed? I'd love more lower level detail here on why the "operations would not do what they announce."

But overall, I think the narrower planned improvement steps are a great step forward in progress in general. The unused work is never a waste, instead, it is all learning and thus beneficial.

I'm happy to provide more details for sure!
To make changes to an OpenRefine project we use two sorts of Java classes, the Operations and the Changes.
The Operations represent the high-level descriptions of what sort of transformation was run on the grid, whereas the Changes hold the data that changed between the two states of the grid and takes care of applying / reverting the change by moving that data to the appropriate location in the grid.

The approach I am proposing here is to add the required metadata to the operations, so that we are able to reliably determine which columns they depend on and modify. Even if Operation classes indeed expose this metadata, the underlying Change class will still be able to make arbitrary changes to the grid. This means that, if there is a bug in the Operation or Change code, it is possible that the Operation announces a columnar scope which does not match what the operation actually does. Such bugs would be similar to some issues we have had in the past, with undoing an operation not being exactly the reverse of applying it, or having other sorts of weird side effects.

Of course, my goal is to implement this without introducing such bugs. The idea I was trying to convey above with the word "speculative" is that the architecture of our code will not support developers in ensuring that the announced column dependencies match the behavior of the operations. In particular, given that extensions can define new operations, it is also possible that such bugs are introduced by them.

Is that any clearer?

1 Like