Work plan for the reproducibility improvements project

From afar, the new project to improve OpenRefine’s reproducibility might look like a fairly chaotic selection of issues with unclear links between them.

The common thread between all of them is that they require changes in the way OpenRefine represents the history of a project. It is worth tackling them together because making changes to the way we represent project history is costly:

  • it is generally difficult to make those changes while staying compatible with previous versions of OpenRefine and with existing extensions
  • it is likely to introduce data loss or corruption issues
  • it involves adaptations to a large number of functionalities (all the ways you can make changes to the grid)

Therefore I am aiming for a new history format which addresses as many of those UX issues as possible. Designing it upfront and implementing it in one go would be a headache, so I am trying to make this process as iterative as possible. Among the use cases listed in the project tracker so far, there are many dependencies and similarities. I have attempted to draw a dependency graph of those tasks, so as to understand better where to start:

The nodes with a thicker border are meant to represent internal changes in history representation, and the others are user-facing changes that are enabled by those internal changes. The dependencies are rather coarse-grained and debatable - it surely already stems from my own interpretation of those user requirements and how I want to solve them.

Generally speaking, my goal is to have things testable as often as possible, so I am plan to work on user-facing issues soon after they are enabled by an internal change. I plan to work on those changes in the 4.0 branch, because the serialization format for projects in that branch is still a work in progress so our hands are not tied by compatibility with previous versions.

Let me know how that sounds and if you have any suggestions!

Hmm, I’ve been thinking as I was commenting on some issues you had in the project. What if we flipped things around BOLDLY. Go for schemaless append-only and track cell changes completely with a change id or ref key as Uber did to version the cells for a given row key and column? It’s a rather ingenious system based on Google’s approaches in Bigtable. Designing Schemaless, Uber Engineering’s Scalable Datastore Using MySQL | Uber Blog

Here is a progress report on this front, for January 2023.

On the development side of things, I have started to tackle the first tasks suggested by the dependency graph above. Specifically, I have worked on:

  • Unified HTTP API to apply operations (#5539)
  • Warn users of history entry loss (#3184)
  • Row/record preservation metadata in the backend (#5561)
  • Preservation of pagination for the undo functionality (#572)

My goal with this was to strike a balance between internal refactoring and user-facing features. Even if the latter ones are rather low-hanging fruits and not necessarily very representative of the benefits the project will bring on the long term, it feels important to me to have some things to show for regularly. Those user-facing changes were also a good occasion to get insightful feedback from other contributors, for instance with suggestions of support for branching project histories, which could totally be in scope for this project. I cannot guarantee that I will be able to retain this way of working at all times: there will definitely be phases where I need to work longer on some internal architecture, without user-facing changes being immediately available. I also acknowledge @tfmorris’ encouragement not to do peace-meal changes and aim for the end state more directly.

My next task is the support for partial change data serialization. Broadly speaking, this is about making it possible for OpenRefine to store intermediate results of long-running operations such as reconciliation, such that we can resume an interrupted reconciliation operation (#87) or show the first few reconciled rows before the entire project data is processed (#5541). Those are features where I will need more design input from the community - I will open specific forum threads for that.

I have also been working on documentation. I want to write a good architectural overview of the new architecture and a migration guide for extension developers, but before I get to that, it felt necessary to do a big clean up of the existing technical documentation. This is long overdue since the technical reference on openrefine.org was still a mostly disorganized copy of some wiki pages, with fairly outdated content. Given that we will likely have prospective interns coming to contribute, it also feels timely to get our house in order before that. I have also worked on improving the Javadoc for the new architecture. The goal is that we can publish Javadocs we can be proud of (#2311) and serve as complement to the high-level architectural overview in the technical reference.

Let me know if you wish for different priorities or ways of working. And get ready for more exciting user-facing changes, which should hopefully appear soon for your review :slight_smile:

Quick update on this. I have updated the tasks plan above to mark the tasks which are complete - meaning that there is a first implementation of them, not that it is final in any way.

More recent progress updates can be found in this thread: Partial results of long-running operations

This month I am not working on new features but rather stabilizing the code base and hunting for performance issues. More about that soon!

In April I have been working on stabilizing the code base and optimizing the loading of large files.
Specifically:

Records mode

I have reduced the need for counting the number of records in the project, by changing the criterion under which we turn on the records mode automatically. The new criterion should be more faithful to the project structure, meaning that there are less risks that we turn on the records mode on a project which is not meant to use this mode (#5661).

In addition, I have made it easier to compute the number of records at import time, during the initial pass over the dataset (which is required for other reasons, such as counting the number of rows and ensuring all rows have the same number of columns).

Overall, this means less passes on the entire dataset.

Better streaming from partitions

To enable parallelization of various computations (such as computing the state of facets), the grid is generally split into multiple partitions stored as distinct files on disk. The grid is read in a streaming fashion from those files.

This streaming was until now represented as a Java 8 Stream. The reason why I made this choice was that I needed an API which allowed:

  • constructing pipelines on a collection with lazy evaluation, meaning that the collections are only actually iterated from when necessary;
  • supporting a closing mechanism, to enable releasing underlying resources when a stream was no longer needed.

In theory, that is something Java 8 Streams provide. But in reality, the internal architecture they are built on means that:

So I have migrated away from such streams back to a custom interface, CloseableIterator, which is the combination of the Iterator<T> and AutoCloseable interfaces. There are tons of projects doing that: sadly it does not seem to be readily available in any standalone library as far as I can tell. To make this interface easier to use, I have based it on the Iterator interface offered by the Vavr library, which comes with plenty of methods to do sensible things on streams (including the ones above that are missing on streams). This library seems to be pretty well established and it feels okay to depend on it, but if we want to avoid depending on it it would not be too hard to re-implement those utility methods on our own.

Given how widespread CloseableIterator / CloseableIterable interfaces are, I think there is a case for including this in Vavr itself - I will approach the maintainers to see if they are interested in that.

To sum up, the benefits of this move are:

  • Noticeable performance gain for operations which store change data (such as reconciliation, fetching URLs…) as their use would slow down the iteration from the grid before this move, due to this spurious caching issue;
  • Better closing of files when they are no longer needed;
  • More readable code by avoiding the writing of custom iterators in many places (thanks to the richer API provided by Vavr)

Better importing infrastructure

There are still many formats for which we are only be able to import a file in OpenRefine if the original file can be loaded entirely in memory. The reason for this is that the Runner interface I designed when abstracting away from Spark in 2020 only made it possible to parse files efficiently when they were line-based. This was motivated by Spark’s own API. This meant that efficient import could only happen when importing CSV/TSV files with certain importing options (or line-based / fixed width importers).

I realized that we could actually make it possible to import any iterable collection of rows. This means that any file that we can stream from can be loaded without much memory. The downside compared to Spark’s API is that we do not take advantage of splitting the file: at import time, we are not able to parallelize reading the dataset. But this is only true for import time: once the dataset is imported as an OpenRefine, we can parallelize operations again. Therefore I have added methods on the Runner interface to create a Grid from a CloseableIterable of rows (so that the underlying files can be closed appropriately).

A side effect of that is that my lousy LineReader class which I had implemented to support splitting text files in the default runner implementation is not used anymore at all for CSV/TSV import and that’s a big performance win too.

Concretely this means:

  • immediately enabling efficient import of all CSV/TSVs regardless of the import options
  • making it possible in the future to enable efficient import of Excel, JSON and XML files. For ODS files I do not think the underlying library would support it.

More optimizations and real-world testing

I have made various smaller optimizations in other places, generally motivated by manual testing on various large datasets I could find (for instance on open data portals). I have had a testing session with @thadguidry where he loaded datasets of his own in the prototype. I am reaching out to people who filed issues about large dataset handling to get their feedback on this new version.

Documentation

I have continued writing more documentation about this architecture, in three directions:

  • general documentation of the structure of the application. I still struggle to document parts which do not feel completely done yet, so that is holding me back a bit. For instance I think there is no point documenting the memory management strategy at the moment given that my plans about it are still evolving.
  • documentation about migrating extensions. I think I want it to be more example-driven than previous migration instructions, and I plan to migrate some third-party extensions myself and write down what I do as I go along.
  • documentation about writing new extensions. I think given the scale of the differences, it will often be easier for extension developers to refer to new, current documentation about how to write an extension from scratch, even if they want to migrate an existing one. Also, I want to bake in more information about how to set up your development environment to develop an extension, how to set up testing for it (including Cypress tests), and how to keep it compatible with newer versions of OpenRefine.

What’s next

In May, I am planning to:

  • keep stabilizing the current prototype and aim to publish a 4.0-alpha2 release to ease external testing
  • keep writing documentation
  • improve the UX for the visualization of partial change data. Because it feels sort of necessary before a new release. I have some exciting prototype to share soon - stay tuned!
  • take some time to have a good think about the architecture to use to expose column dependencies of operations. There are various conflicting requirements I want to ponder and experiment with.

And here is a very quick video to demonstrate some effects of that work:

This issue summarizes which operations can be run on datasets which do not fit in RAM:

If you want to try it out, we have some snapshots available for that:

or by running it manually from the 4.0 branch.