Work plan for the reproductibility improvements project

From afar, the new project to improve OpenRefine’s reproducibility might look like a fairly chaotic selection of issues with unclear links between them.

The common thread between all of them is that they require changes in the way OpenRefine represents the history of a project. It is worth tackling them together because making changes to the way we represent project history is costly:

  • it is generally difficult to make those changes while staying compatible with previous versions of OpenRefine and with existing extensions
  • it is likely to introduce data loss or corruption issues
  • it involves adaptations to a large number of functionalities (all the ways you can make changes to the grid)

Therefore I am aiming for a new history format which addresses as many of those UX issues as possible. Designing it upfront and implementing it in one go would be a headache, so I am trying to make this process as iterative as possible. Among the use cases listed in the project tracker so far, there are many dependencies and similarities. I have attempted to draw a dependency graph of those tasks, so as to understand better where to start:

The nodes with a thicker border are meant to represent internal changes in history representation, and the others are user-facing changes that are enabled by those internal changes. The dependencies are rather coarse-grained and debatable - it surely already stems from my own interpretation of those user requirements and how I want to solve them.

Generally speaking, my goal is to have things testable as often as possible, so I am plan to work on user-facing issues soon after they are enabled by an internal change. I plan to work on those changes in the 4.0 branch, because the serialization format for projects in that branch is still a work in progress so our hands are not tied by compatibility with previous versions.

Let me know how that sounds and if you have any suggestions!

Hmm, I’ve been thinking as I was commenting on some issues you had in the project. What if we flipped things around BOLDLY. Go for schemaless append-only and track cell changes completely with a change id or ref key as Uber did to version the cells for a given row key and column? It’s a rather ingenious system based on Google’s approaches in Bigtable. Designing Schemaless, Uber Engineering’s Scalable Datastore Using MySQL | Uber Blog

Here is a progress report on this front, for January 2023.

On the development side of things, I have started to tackle the first tasks suggested by the dependency graph above. Specifically, I have worked on:

  • Unified HTTP API to apply operations (#5539)
  • Warn users of history entry loss (#3184)
  • Row/record preservation metadata in the backend (#5561)
  • Preservation of pagination for the undo functionality (#572)

My goal with this was to strike a balance between internal refactoring and user-facing features. Even if the latter ones are rather low-hanging fruits and not necessarily very representative of the benefits the project will bring on the long term, it feels important to me to have some things to show for regularly. Those user-facing changes were also a good occasion to get insightful feedback from other contributors, for instance with suggestions of support for branching project histories, which could totally be in scope for this project. I cannot guarantee that I will be able to retain this way of working at all times: there will definitely be phases where I need to work longer on some internal architecture, without user-facing changes being immediately available. I also acknowledge @tfmorris’ encouragement not to do peace-meal changes and aim for the end state more directly.

My next task is the support for partial change data serialization. Broadly speaking, this is about making it possible for OpenRefine to store intermediate results of long-running operations such as reconciliation, such that we can resume an interrupted reconciliation operation (#87) or show the first few reconciled rows before the entire project data is processed (#5541). Those are features where I will need more design input from the community - I will open specific forum threads for that.

I have also been working on documentation. I want to write a good architectural overview of the new architecture and a migration guide for extension developers, but before I get to that, it felt necessary to do a big clean up of the existing technical documentation. This is long overdue since the technical reference on was still a mostly disorganized copy of some wiki pages, with fairly outdated content. Given that we will likely have prospective interns coming to contribute, it also feels timely to get our house in order before that. I have also worked on improving the Javadoc for the new architecture. The goal is that we can publish Javadocs we can be proud of (#2311) and serve as complement to the high-level architectural overview in the technical reference.

Let me know if you wish for different priorities or ways of working. And get ready for more exciting user-facing changes, which should hopefully appear soon for your review :slight_smile: