Reproducibility project: new approach

antonin_d · July 29, 2024, 9:22am

Hi all,

I'd like to propose a new approach for the reproducibility project: implementing the planned improvements as lightweight changes on top of the existing architecture.

I would do this in three broad steps (all of which will require a bunch of pull requests on their own):

Add metadata to existing operations, in a backwards compatible way, so that they expose their expectations on the table structure (input columns, overlay models required) as well as the expected impact they have on the grid (columns created or deleted, preservation of record boundaries). This metadata is speculative, in the sense that the architecture does not validate that the operation fully respects it, but it should be sufficient to enable the following user-facing improvements
Implement a better visualization of the operation history, following the existing designs. The metadata exposed earlier can be used to show the operation dependencies and the operation icons can help understand the structure of the history in a more visual way
Implement better error-reporting and validation to apply recipes, also based on the metadata introduced earlier.

It's not going to be as robust as what I was planning to implement with the new history architecture, because with this approach there is no guarantee that operations indeed do what they announce, but I think that it's still going to bring useful improvements to users. After all, there is currently also no guarantee that undoing an operation indeed brings you back to the state you had before applying it, but the bugs that causes are rather rare. It also eases the pressure by removing the dependency breaking changes, making it easier to implement this as a series of small changes on top of our existing architecture. The more ambitious implementation I have worked on last year could still be used in the future, when we have a clearer idea of how we want to schedule its implementation with respect to other breaking changes and improvements to the extensions system.

What do you think?

abbe98 · July 29, 2024, 12:16pm

Add metadata to existing operations, in a backwards compatible way, so that they expose their expectations on the table structure (input columns, overlay models required) as well as the expected impact they have on the grid (columns created or deleted, preservation of record boundaries). This metadata is speculative, in the sense that the architecture does not validate that the operation fully respects it, but it enables a better error reporting ahead of applying a series of operations, but it should be sufficient to enable the following user-facing improvements

I'm very supportive of this idea, especially if it could be done in a way which would be utilized/extended by extensions. The major use case on my end would be to connect users with operations so that one can see who did what.

tfmorris · July 29, 2024, 6:14pm

Those sound like useful improvements. I think better error checking/handling/reporting is the most valuable piece if you want to promote the operation history as a generally usable recipe function. My, perhaps naive, impression is that using one of the unsupported clients is a relatively popular way to apply recipes. If that's actually the case, I'd strongly consider adding versioning to the protocol and making it more robust as an important prerequisite. Lastly, JSON is, in my opinion, an unacceptable base upon which to build a DSL that's intended to be human editable. I would suggest YAML would be a better starting point. I can resurrect my PoC to show that.

I recognize that some or all of this may not fit into the available funding and/or the way the grant was written, but addresses the general topic of "reproducibility."

Tom

thadguidry · July 30, 2024, 1:18am

Perspective, and Reminder of what's really important to our users:

Client library usage is extremely unpopular.
Client library usage was only 3% of all plugin usage from all 207 respondents in the 2022 survey, basically less than 6 folks out of 207 surveyed.

Reproducibility or concerns of improving, modifying, or replaying history, or general workflow feature improvements was only asked for by 1 respondent, which was me, back in 2010 when I asked David if we could support undoing a single change in the history Selective undo · Issue #183 · OpenRefine/OpenRefine (github.com).

I've never seen data or surveys or other folks' names that were asking for improvements in how Undo/Redo currently works (other than adding a warning) or in better support for repeatable workflows?

What users wanted was a warning to not shoot themselves in the foot and lose forward history unintentionally. We gave them that in Add dialog to warn of history entry deletion by wetneb · Pull Request #6659 · OpenRefine/OpenRefine · GitHub

Generally, as evidenced in the survey, folks continue to want

Large data support
Easier UX and QOL improvements (Drag and Drop column reordering, Append Rows, Guides, Easier GREL or Menus for often used complex RegEx patterns, etc.)

thadguidry · August 3, 2024, 11:06pm

In regards to your new approach ideas @antonin_d ,

I have but one question, why would the metadata be speculative and the architecture not know for sure if an operation respects "it"? What is "it" in that sense, the metadata? That seems oddly worded because I would think this way, that for each operation we would have metadata that shows all the dependencies of the operation (GREL, operations, columns, facet selections). But that would not be the case you are saying and some of the dependencies would be guessed? I'd love more lower level detail here on why the "operations would not do what they announce."

But overall, I think the narrower planned improvement steps are a great step forward in progress in general. The unused work is never a waste, instead, it is all learning and thus beneficial.

antonin_d · August 27, 2024, 1:07pm

I'm happy to provide more details for sure!
To make changes to an OpenRefine project we use two sorts of Java classes, the Operations and the Changes.
The Operations represent the high-level descriptions of what sort of transformation was run on the grid, whereas the Changes hold the data that changed between the two states of the grid and takes care of applying / reverting the change by moving that data to the appropriate location in the grid.

The approach I am proposing here is to add the required metadata to the operations, so that we are able to reliably determine which columns they depend on and modify. Even if Operation classes indeed expose this metadata, the underlying Change class will still be able to make arbitrary changes to the grid. This means that, if there is a bug in the Operation or Change code, it is possible that the Operation announces a columnar scope which does not match what the operation actually does. Such bugs would be similar to some issues we have had in the past, with undoing an operation not being exactly the reverse of applying it, or having other sorts of weird side effects.

Of course, my goal is to implement this without introducing such bugs. The idea I was trying to convey above with the word "speculative" is that the architecture of our code will not support developers in ensuring that the announced column dependencies match the behavior of the operations. In particular, given that extensions can define new operations, it is also possible that such bugs are introduced by them.

Is that any clearer?

Topic		Replies	Views
Work plan for the reproducibility improvements project Development & Design	6	1114	March 6, 2025
Reproducibility project: January 2025 report Day-to-day project operations	0	26	February 3, 2025
Reproducibility project: February 2025 report Day-to-day project operations	0	13	March 4, 2025
Reproducibility project: October 2024 report Day-to-day project operations	0	16	November 7, 2024
Reproducibility project: October report Day-to-day project operations	0	193	November 7, 2023

Reproducibility project: new approach

Related topics