Reproducibility project: September report

Here is an update about the progress on the reproducibility project in September.

Most of my work on this project last month went towards restructuring the 4.0 branch to make it easier to review. This meant, concretely:

  • reordering commits so that changes to the same topic are done successively
  • squashing fixes with the commit which introduced the bug to be fixed
  • fixing some compilation or test issues earlier to preserve a clean build in more stages of the refactoring
  • removing some spurious merge commits (between local and remote versions of the 4.0 branch)
  • fusing together some merge commits from master into 4.0 when they are close enough

This was done with a succession of rebases, for which I developed some workflows to avoid having to solve too many conflicts over and over. While the process was quite tedious in the first place, now I got more efficient and feel freer in doing more re-structuring work in the future if needed.
For now I have mostly concentrated my efforts on the earlier steps of the refactoring, which I re-organized into chunks that I have presented as pull requests on my fork. These correspond to about half of the commit log of the 4.0 branch and I plan to open more of those pull requests (in the main repository) if the format suits @tfmorris and @abbe98 who have expressed interest in reviewing.

But beyond this restructuring work, I also got to make progress on the next steps of the project: the introduction of the columnar dependency metadata for operations. This is exciting, because it's finally touching on features which are at the core of the reproducibility topic. Here is a quick reminder of what this is about. The goal is that OpenRefine is able to record, for each change made to the project grid, which columns were affected by the change, and which columns it depended on.

For instance, when running a GREL transform with expression value + " " + cells["Last_name"].value on a column called First_name, the only column being modified is that column First_name and the operation depends on both the First_name and Last_name columns. If any facets were active during the transformation, the columns they depend on are also dependencies for the change.

For now, those dependencies are not recorded at all in OpenRefine. By recording them, the goal is to enable the following features:

  • make it possible to run operations concurrently if they touch distinct parts of the grid. For instance, reconciling two columns at the same time.
  • detect the requirements for a series of operations (currently, represented a JSON blobs) before re-applying them. At the moment, if the operations rely on columns which do not exist in the grid they are re-applied to, the execution can fail randomly. Instead, we want to give the user the opportunity to indicate which columns should be used instead, and adapt the operations automatically.
  • undo operations which are not the last ones to have been executed, with the ability to retain some of the operations executed afterwards if they are independent from the operation being undone. For instance, you spent a lot of time reconciling a column (including manual reconciliation judgments) and then realize that you want to undo an operation you did before on an unrelated column.
  • re-fetch the external data obtained in a long-running operation (such as reconciliation, or fetching from URLs)
  • visualize histories with a graph-based structure, to understand better the structure of a data cleaning project.
  • and many more!

More broadly, the vision behind this improved metadata for operations is to make it possible for users to define their own macros, by selecting operations they executed on a project and making them re-usable. Those macros could be made accessible from the menus of their own OpenRefine instance, or shared with others. This would be a sort of generalization of the existing starred expressions, in a sense.

So far I have been experimenting with metadata structures, with primarily concurrency of long-running operations in mind. I will soon post more on this topic, including a demo of what it could look like. The goal of this experiment is to converge towards a particular format for this metadata and a way to introduce it in the code base. Because migrating each operation to expose this additional metadata takes time, it is worth having a clear idea of what is needed.

In the coming month I plan to keep working on this operation dependency issue. I would also like to start testing the existing 4.0 branch with a couple of trusted users to gather some initial feedback, but how soon that happens depends on the hiring of the designer which will be involved in this testing. I anticipate I will also spend some more time on restructuring the branch and responding to reviews on the existing branch.


I have posted more details of my work on concurrent execution in this other thread: