Work plan for the reproducibility improvements project

antonin_d · April 27, 2023, 11:08am

In April I have been working on stabilizing the code base and optimizing the loading of large files.
Specifically:

Records mode

I have reduced the need for counting the number of records in the project, by changing the criterion under which we turn on the records mode automatically. The new criterion should be more faithful to the project structure, meaning that there are less risks that we turn on the records mode on a project which is not meant to use this mode (#5661).

In addition, I have made it easier to compute the number of records at import time, during the initial pass over the dataset (which is required for other reasons, such as counting the number of rows and ensuring all rows have the same number of columns).

Overall, this means less passes on the entire dataset.

Better streaming from partitions

To enable parallelization of various computations (such as computing the state of facets), the grid is generally split into multiple partitions stored as distinct files on disk. The grid is read in a streaming fashion from those files.

This streaming was until now represented as a Java 8 Stream. The reason why I made this choice was that I needed an API which allowed:

constructing pipelines on a collection with lazy evaluation, meaning that the collections are only actually iterated from when necessary;
supporting a closing mechanism, to enable releasing underlying resources when a stream was no longer needed.

In theory, that is something Java 8 Streams provide. But in reality, the internal architecture they are built on means that:

There are cases when the stream gets buffered internally, even though it’s not actually needed to perform the required computation. This is something that was hitting us: sometimes, an entire partition could get loaded in memory even though we just wanted to see the first 10 rows of the project.
There are standard stream constructs, such as zipping two streams together, which are just not available in the API and require switching back to iterators. Similarly for grouping items of a stream by batches of a given size.

So I have migrated away from such streams back to a custom interface, CloseableIterator, which is the combination of the Iterator<T> and AutoCloseable interfaces. There are tons of projects doing that: sadly it does not seem to be readily available in any standalone library as far as I can tell. To make this interface easier to use, I have based it on the Iterator interface offered by the Vavr library, which comes with plenty of methods to do sensible things on streams (including the ones above that are missing on streams). This library seems to be pretty well established and it feels okay to depend on it, but if we want to avoid depending on it it would not be too hard to re-implement those utility methods on our own.

Given how widespread CloseableIterator / CloseableIterable interfaces are, I think there is a case for including this in Vavr itself - I will approach the maintainers to see if they are interested in that.

To sum up, the benefits of this move are:

Noticeable performance gain for operations which store change data (such as reconciliation, fetching URLs…) as their use would slow down the iteration from the grid before this move, due to this spurious caching issue;
Better closing of files when they are no longer needed;
More readable code by avoiding the writing of custom iterators in many places (thanks to the richer API provided by Vavr)

Better importing infrastructure

There are still many formats for which we are only be able to import a file in OpenRefine if the original file can be loaded entirely in memory. The reason for this is that the Runner interface I designed when abstracting away from Spark in 2020 only made it possible to parse files efficiently when they were line-based. This was motivated by Spark’s own API. This meant that efficient import could only happen when importing CSV/TSV files with certain importing options (or line-based / fixed width importers).

I realized that we could actually make it possible to import any iterable collection of rows. This means that any file that we can stream from can be loaded without much memory. The downside compared to Spark’s API is that we do not take advantage of splitting the file: at import time, we are not able to parallelize reading the dataset. But this is only true for import time: once the dataset is imported as an OpenRefine, we can parallelize operations again. Therefore I have added methods on the Runner interface to create a Grid from a CloseableIterable of rows (so that the underlying files can be closed appropriately).

A side effect of that is that my lousy LineReader class which I had implemented to support splitting text files in the default runner implementation is not used anymore at all for CSV/TSV import and that’s a big performance win too.

Concretely this means:

immediately enabling efficient import of all CSV/TSVs regardless of the import options
making it possible in the future to enable efficient import of Excel, JSON and XML files. For ODS files I do not think the underlying library would support it.

More optimizations and real-world testing

I have made various smaller optimizations in other places, generally motivated by manual testing on various large datasets I could find (for instance on open data portals). I have had a testing session with @thadguidry where he loaded datasets of his own in the prototype. I am reaching out to people who filed issues about large dataset handling to get their feedback on this new version.

Documentation

I have continued writing more documentation about this architecture, in three directions:

general documentation of the structure of the application. I still struggle to document parts which do not feel completely done yet, so that is holding me back a bit. For instance I think there is no point documenting the memory management strategy at the moment given that my plans about it are still evolving.
documentation about migrating extensions. I think I want it to be more example-driven than previous migration instructions, and I plan to migrate some third-party extensions myself and write down what I do as I go along.
documentation about writing new extensions. I think given the scale of the differences, it will often be easier for extension developers to refer to new, current documentation about how to write an extension from scratch, even if they want to migrate an existing one. Also, I want to bake in more information about how to set up your development environment to develop an extension, how to set up testing for it (including Cypress tests), and how to keep it compatible with newer versions of OpenRefine.

What’s next

In May, I am planning to:

keep stabilizing the current prototype and aim to publish a 4.0-alpha2 release to ease external testing
keep writing documentation
improve the UX for the visualization of partial change data. Because it feels sort of necessary before a new release. I have some exciting prototype to share soon - stay tuned!
take some time to have a good think about the architecture to use to expose column dependencies of operations. There are various conflicting requirements I want to ponder and experiment with.

Topic		Replies	Views
Reproducibility project: September report Day-to-day project operations	1	300	October 4, 2023
Reproducibility project: new approach Development & Design	5	35	August 27, 2024
Reproducibility project: January 2024 report Day-to-day project operations	2	151	February 6, 2024
Reproducibility project: March 2024 report Day-to-day project operations	0	128	April 8, 2024
Reproducibility project: December 2024 report Day-to-day project operations	1	23	January 2, 2025