Reproducibility project: November 2023 report

Here is the usual monthly summary of my progress on the reproducibility project.

I already posted an update about my work on concurrency in the dedicated thread. But this month most of my work went into rebasing and re-organizing my work following up on the feedback of @tfmorris and @abbe98.

Given the scale of the rebasing needed, I have invested some time in building tooling to optimize the process, minimizing the conflicts that need to be resolved. I think it's a valuable investment that can be useful to others as well, be it in OpenRefine itself or in extensions, forks and so on, because it does speed up the process quite a bit.

This gave rise to essentially two pieces of tooling:

  • a way to quickly reformat the entire Java codebase. This is an operation I need to do very often to avoid spurious formatting conflicts, especially because my work on the new architecture started before we introduced formatting checks in our CI. Running the formatter via Maven is quite slow, so I have built a small tool to do that faster and wrote a blog post to describe the approach. Although I originally wanted to use this as a git filter driver, I don't actually do that, but instead use this mostly as a building block for the second piece of tooling
  • a custom merge driver, to automatically resolve merge conflicts which consist only of import statements. Those make up a very significant proportion of the merge conflicts I have had to solve when merging the master branch into the 4.0 branch. Git offers functionality to plug in a custom logic to merge files, which gives the opportunity to improve on the generic algorithms that it offers. My approach is roughly as follows:
    • given the three versions of a file to be merged together (the two other diverging ones and their common ancestor), I first reformat all three files using the utility mentioned above
    • I then call Git's own file merging algorithm via git merge-file
    • I post-process the output of this merge, parsing each conflict hunk and detecting if it consists only of imports
    • If it does, then I resolve the conflict by taking the combination of added and removed imports on both sides.

The custom merge driver is tailored to OpenRefine but it could probably be re-packaged as a generic merge driver for Java files, which could probably be useful to others.

To develop this tool, I had to patch git itself, because the git merge-file command produced results that were inferior to the ones generated by git merge (or git rebase, git cherry-pick). That is because this command was using an outdated diff algorithm and did not offer a way to switch to more modern variants. Thankfully, my patch seems to be on the right path to get merged and released, so I am hopeful that I can soon share the merge driver as something that can be used without having to patch git.

With this tooling, rebasing my work on top of a recent version of the master branch is more tractable: I have updated the first two pull requests of the series (#334 and #335) accordingly. The formatter isn't very easily shareable because it's a binary file that would need to be built for various architectures, but if there is interest I could likely set up some workflow to make binaries for major architectures.

I have attempted to rebase the rest of the work, but I am not convinced rebasing my existing commit log on that branch is the best approach. The structure of that commit history is following my own process to develop this new architecture: there is of course some logic to it, but it's also not the fastest, most direct way to introduce the features in the code base. For instance, I introduced a dependency on Spark and removed it later on. One could well introduce the current system for pluggable execution backend without going through that. Similarly, I made some iterative changes to the project serialization format, as I discovered the need for it. I would instead introduce the latest format from the start. So I am thinking about changing my strategy, by instead taking the current end state of the branch and decomposing it into a much smaller set of commits. I am thinking about the following commit structure:

  • [to merge in 3.x] change the project creation utilities in tests, which currently uses the CSV importer under the hood, not to depend on the CSV importer and instead supply an array of cell values from the test itself. This lets us specify cell values more precisely (such as the distinction between null values and empty strings, or inserting reconciled cells) and decouples the testing utility from the CSV importer, making it possible to expose it as an independent Maven artifact. This Maven artifact can then be used to test other Maven modules in our code base or be relied on by extensions for their own testing needs (without having to copy those utilities over from our code base).
  • [to merge in 3.x] backport to 3.x all the tests I wrote for operations, importers and exporters, using a syntax that is ideally as independent as possible from the architecture, so that tests can easily be transferred between the two branches
  • [to merge in 3.x] fixing the class name mapping mechanism
  • Changes in Maven modularization (basically #334 but with some more module splitting)
  • Migration from com.google.refine to org.openrefine (#335)
  • Switch to a different default workspace directory
  • Introduction of the Runner / Grid / ChangeData interfaces. This would already contain the latest interfaces and the implementation of the runners. This is bound to be a very big commit - I don't see how it can be meaningfully split into intermediate steps, but I am open to suggestions. (The intermediate steps I had in my own commit log were obtained by selectively migrating parts of the code, giving up on having a code base that compiles and passes the tests at all times). This big commit should have fairly minimal impact on the frontend though (although some APIs will change, such as pagination).
  • Remove operation commands, replacing them by the generic apply-operations command
  • Add a confirmation dialog to guard the erasure of project history
  • Make commands return meaningful HTTP status codes and let the frontend report an error if an error HTTP status code is returned
  • Resizeable columns (#5840)

There might be a few more changes that are worth pulling out of the big commit in the middle, either before or after it: whether that's doable needs to be investigated on a case-by-case basis, looking at the dependencies of the change and the interest to review it separately.

The aim with this new structure is to:

  • provide some guarantees about preserving the behaviour of existing operations, by showing that tests pass before and after the rewrite
  • provide a single migration step to extension developers instead of leading throw a long-winded path of successive refactors
  • make it easier to keep maintaining both branches by simplifying the porting of tests from one branch to another
  • make review easier, by not having people review outdated architectural choices

Obviously there is a big question mark around how doable it is to restructure the history like this, but to me this looks more useful and doable this way. Let me know what you think!

In parallel of this restructuring work, I intend to keep working on features and user testing with the designer, who is coming onboard this month.

1 Like

Maybe my message above didn't make that clear, but I would really like to know if @tfmorris and @abbe98 would be happy with the commit structure I proposed above.

Because it's a lot of work to reorganize it as such, I would only do that if they are happy with this structure.

Maybe my message above didn't make that clear, but I would really like to know if @tfmorris and @abbe98 would be happy with the commit structure I proposed above.

I don't see any previous messages in this thread. I don't know if the original message didn't get forwarded from the forum or it got sent to my spam folder or what, but I'll log in to the forum and look for the message you want reviewed.

Tom

I think applying as much as possible to the 3.x branch and eliminating various false starts which were later reversed are both useful approaches. I think the real question will be how big / reviewable is the "big commit" in the middle when all this is done and I have no way really of judging that.

Similarly, I made some iterative changes to the project serialization format, as I discovered the need for it. I would instead introduce the latest format from the start.

Something else which might be useful in addition to the squashed commit is some design documentation which explains what the various changes are (if it's not obvious from the code).

  • Changes in Maven modularization (basically #334 but with some more module splitting)

I don't really understand what this buys us. Is it high enough value to offset the churn of the code base? Each shuffling of things increases the difficulty of tracking the provenance of real functional changes.

  • Introduction of the Runner / Grid / ChangeData interfaces. This would already contain the latest interfaces and the implementation of the runners. This is bound to be a very big commit - I don't see how it can be meaningfully split into intermediate steps, but I am open to suggestions. (The intermediate steps I had in my own commit log were obtained by selectively migrating parts of the code, giving up on having a code base that compiles and passes the tests at all times). This big commit should have fairly minimal impact on the frontend though (although some APIs will change, such as pagination).
  • Remove operation commands, replacing them by the generic apply-operations command

I don't understand this last bullet. Is it separate from the bullet above or part of the big mongo commit?

  • Add a confirmation dialog to guard the erasure of project history
  • Make commands return meaningful HTTP status codes and let the frontend report an error if an error HTTP status code is returned
  • Resizeable columns (#5840)

These last three seem like useful things for the current users and might be something we want to incorporate in the 3.x branch as well. I actually already cherry-picked your Ajax global error handler for some of my error handling cleanup.

Tom

Thanks for having a look!

I think the real question will be how big / reviewable is the "big commit" in the middle when all this is done and I have no way really of judging that.

I think it's not going to be very insightful to read through the entire diff, because it's going to be quite a long diff and it will not feel incremental. For instance, take ColumnSplitOperation.java and compare its version in the master branch and in the 4.0 branch. Apart from the namespace change, most of the differences between those two files will happen in that big commit.

To summarize the differences verbally in this case: in 3.x, the operation executes by generating a Change object which contains the list of splits across the project. The extra columns needed to store those splits are created dynamically as they become needed (when the maximum number of splits increases). In 4.x, the operation is executed by applying a mapping function on each row. Because this mapping function is potentially applied lazily internally (meaning, only later when rows actually need to be computed to be shown onscreen, for instance), we cannot rely on this to update the list of columns (which needs to be computed immediately, when the operation is invoked). Therefore, a first pass on the grid is done when the operation is applied, via an "aggregator" which computes the maximum number of splits generated by the operation, from which we infer the new list of columns.

Personally I think it's not the sort of difference that is convenient to review just as a diff: I prefer to read the version before and the version after independently. As much as I'd like to present this in a more incremental fashion, with a series of smaller steps, I just don't see how it can be done for a change of this sort.

That being said, by backporting some of my testing changes in 3.x first (see first two bullet points in the list above), I am hopeful that the diffs on tests will be smaller (with most changes following a clear, mechanical pattern), which should help convince you that functionality was preserved.
The changes to the Cypress test suite will be minimal in any case.

Something else which might be useful in addition to the squashed commit is some design documentation which explains what the various changes are (if it's not obvious from the code).

I think that's indeed a much easier way to approach it. I have written a draft of extension migration guide and my hope that it would also be helpful to review architectural changes from a higher-level perspective.
In particular, the section about migrating from an in-memory project data storage to the runner architecture is the one that explains the changes that would be reflected in this big central commit. I have also drafted a description of the 4.0 architecture without reference to the 3.x architecture but I suspect it is less useful to review the change.

I would be really, really happy about getting feedback about those texts and am keen to improve them as needed.

I don't really understand what this buys us. Is it high enough value to offset the churn of the code base? Each shuffling of things increases the difficulty of tracking the provenance of real functional changes.

For me, the main things this buys us are:

  • guaranteeing that the business logic (operations, facets, importers, exporters) do not depend on implementation-specific details of the Grid and Runner interfaces. This is helpful to make sure that the execution backend (which implements those interfaces) can really be swapped for something else, such as Spark, or a stream-based alternative for instance.
  • exposing test utilities in a standalone module, so that extensions can rely on it without having to copy over test utilities from our code base. That alone requires quite some modularization: it requires not just a separate refine-testing module, but also a split between refine-model (the core model classes such as Project, Row, Cell…) and refine-workflow (the operations, facets, importers, exporters…). That's because refine-testing needs to be able to refer to the core model classes but is used in the tests of operations, for instance.

I agree that moving those files has a cost, but I think if it's properly isolated in a commit it should be manageable from a review perspective. My earlier version of that commit indeed had some other small changes (such as dependency changes) mixed up, but I have fixed that since then (see this PR comment). It would become even more lightweight with testing changes being backported to 3.x.

In any case it should be doable to rework the modularization (rebasing other changes on top) if we want to change the boundary between modules.

I don't understand this last bullet. Is it separate from the bullet above or part of the big mongo commit?

No, each bullet point would be separate. By "remove operation commands, replacing them by the generic apply-operations command", I mean #5539.

These last three seem like useful things for the current users

I think (at least I hope) that is true of all the changes I made for 4.x, right? The goal is a better UX for the current users of the tool. If there are changes which you think don't serve our current users, I am keen to discuss that.

and might be something we want to incorporate in the 3.x branch as well. I actually already cherry-picked your Ajax global error handler for some of my error handling cleanup.

I am open to backporting the first and third one. For the second one I feel like this is pretty risky: the back-end side error handling is pretty bad, and I remember @Abbe98 had concerns about my approach on the front-end side.

I have started work on restructuring the commit log as sketched above, starting with the first bullet point: