Here is the usual monthly summary of my progress on the reproducibility project.
I already posted an update about my work on concurrency in the dedicated thread. But this month most of my work went into rebasing and re-organizing my work following up on the feedback of @tfmorris and @abbe98.
Given the scale of the rebasing needed, I have invested some time in building tooling to optimize the process, minimizing the conflicts that need to be resolved. I think it's a valuable investment that can be useful to others as well, be it in OpenRefine itself or in extensions, forks and so on, because it does speed up the process quite a bit.
This gave rise to essentially two pieces of tooling:
- a way to quickly reformat the entire Java codebase. This is an operation I need to do very often to avoid spurious formatting conflicts, especially because my work on the new architecture started before we introduced formatting checks in our CI. Running the formatter via Maven is quite slow, so I have built a small tool to do that faster and wrote a blog post to describe the approach. Although I originally wanted to use this as a git filter driver, I don't actually do that, but instead use this mostly as a building block for the second piece of tooling
- a custom merge driver, to automatically resolve merge conflicts which consist only of import statements. Those make up a very significant proportion of the merge conflicts I have had to solve when merging the master branch into the 4.0 branch. Git offers functionality to plug in a custom logic to merge files, which gives the opportunity to improve on the generic algorithms that it offers. My approach is roughly as follows:
- given the three versions of a file to be merged together (the two other diverging ones and their common ancestor), I first reformat all three files using the utility mentioned above
- I then call Git's own file merging algorithm via
git merge-file
- I post-process the output of this merge, parsing each conflict hunk and detecting if it consists only of imports
- If it does, then I resolve the conflict by taking the combination of added and removed imports on both sides.
The custom merge driver is tailored to OpenRefine but it could probably be re-packaged as a generic merge driver for Java files, which could probably be useful to others.
To develop this tool, I had to patch git itself, because the git merge-file
command produced results that were inferior to the ones generated by git merge
(or git rebase
, git cherry-pick
). That is because this command was using an outdated diff algorithm and did not offer a way to switch to more modern variants. Thankfully, my patch seems to be on the right path to get merged and released, so I am hopeful that I can soon share the merge driver as something that can be used without having to patch git.
With this tooling, rebasing my work on top of a recent version of the master branch is more tractable: I have updated the first two pull requests of the series (#334 and #335) accordingly. The formatter isn't very easily shareable because it's a binary file that would need to be built for various architectures, but if there is interest I could likely set up some workflow to make binaries for major architectures.
I have attempted to rebase the rest of the work, but I am not convinced rebasing my existing commit log on that branch is the best approach. The structure of that commit history is following my own process to develop this new architecture: there is of course some logic to it, but it's also not the fastest, most direct way to introduce the features in the code base. For instance, I introduced a dependency on Spark and removed it later on. One could well introduce the current system for pluggable execution backend without going through that. Similarly, I made some iterative changes to the project serialization format, as I discovered the need for it. I would instead introduce the latest format from the start. So I am thinking about changing my strategy, by instead taking the current end state of the branch and decomposing it into a much smaller set of commits. I am thinking about the following commit structure:
- [to merge in 3.x] change the project creation utilities in tests, which currently uses the CSV importer under the hood, not to depend on the CSV importer and instead supply an array of cell values from the test itself. This lets us specify cell values more precisely (such as the distinction between null values and empty strings, or inserting reconciled cells) and decouples the testing utility from the CSV importer, making it possible to expose it as an independent Maven artifact. This Maven artifact can then be used to test other Maven modules in our code base or be relied on by extensions for their own testing needs (without having to copy those utilities over from our code base).
- [to merge in 3.x] backport to 3.x all the tests I wrote for operations, importers and exporters, using a syntax that is ideally as independent as possible from the architecture, so that tests can easily be transferred between the two branches
- [to merge in 3.x] fixing the class name mapping mechanism
- Changes in Maven modularization (basically #334 but with some more module splitting)
- Migration from
com.google.refine
toorg.openrefine
(#335) - Switch to a different default workspace directory
- Introduction of the Runner / Grid / ChangeData interfaces. This would already contain the latest interfaces and the implementation of the runners. This is bound to be a very big commit - I don't see how it can be meaningfully split into intermediate steps, but I am open to suggestions. (The intermediate steps I had in my own commit log were obtained by selectively migrating parts of the code, giving up on having a code base that compiles and passes the tests at all times). This big commit should have fairly minimal impact on the frontend though (although some APIs will change, such as pagination).
- Remove operation commands, replacing them by the generic
apply-operations
command - Add a confirmation dialog to guard the erasure of project history
- Make commands return meaningful HTTP status codes and let the frontend report an error if an error HTTP status code is returned
- Resizeable columns (#5840)
There might be a few more changes that are worth pulling out of the big commit in the middle, either before or after it: whether that's doable needs to be investigated on a case-by-case basis, looking at the dependencies of the change and the interest to review it separately.
The aim with this new structure is to:
- provide some guarantees about preserving the behaviour of existing operations, by showing that tests pass before and after the rewrite
- provide a single migration step to extension developers instead of leading throw a long-winded path of successive refactors
- make it easier to keep maintaining both branches by simplifying the porting of tests from one branch to another
- make review easier, by not having people review outdated architectural choices
Obviously there is a big question mark around how doable it is to restructure the history like this, but to me this looks more useful and doable this way. Let me know what you think!
In parallel of this restructuring work, I intend to keep working on features and user testing with the designer, who is coming onboard this month.