Work plan for the reproducibility improvements project

antonin_d · January 15, 2023, 10:47pm

From afar, the new project to improve OpenRefine's reproducibility might look like a fairly chaotic selection of issues with unclear links between them.

The common thread between all of them is that they require changes in the way OpenRefine represents the history of a project. It is worth tackling them together because making changes to the way we represent project history is costly:

it is generally difficult to make those changes while staying compatible with previous versions of OpenRefine and with existing extensions
it is likely to introduce data loss or corruption issues
it involves adaptations to a large number of functionalities (all the ways you can make changes to the grid)

Therefore I am aiming for a new history format which addresses as many of those UX issues as possible. Designing it upfront and implementing it in one go would be a headache, so I am trying to make this process as iterative as possible. Among the use cases listed in the project tracker so far, there are many dependencies and similarities. I have attempted to draw a dependency graph of those tasks, so as to understand better where to start:

The nodes with a thicker border are meant to represent internal changes in history representation, and the others are user-facing changes that are enabled by those internal changes. The dependencies are rather coarse-grained and debatable - it surely already stems from my own interpretation of those user requirements and how I want to solve them.

Generally speaking, my goal is to have things testable as often as possible, so I am plan to work on user-facing issues soon after they are enabled by an internal change. I plan to work on those changes in the 4.0 branch, because the serialization format for projects in that branch is still a work in progress so our hands are not tied by compatibility with previous versions.

Let me know how that sounds and if you have any suggestions!

thadguidry · January 16, 2023, 9:12am

Hmm, I’ve been thinking as I was commenting on some issues you had in the project. What if we flipped things around BOLDLY. Go for schemaless append-only and track cell changes completely with a change id or ref key as Uber did to version the cells for a given row key and column? It’s a rather ingenious system based on Google’s approaches in Bigtable. Designing Schemaless, Uber Engineering’s Scalable Datastore Using MySQL | Uber Blog

antonin_d · February 3, 2023, 10:29am

Here is a progress report on this front, for January 2023.

On the development side of things, I have started to tackle the first tasks suggested by the dependency graph above. Specifically, I have worked on:

Unified HTTP API to apply operations (#5539)
Warn users of history entry loss (#3184)
Row/record preservation metadata in the backend (#5561)
Preservation of pagination for the undo functionality (#572)

My goal with this was to strike a balance between internal refactoring and user-facing features. Even if the latter ones are rather low-hanging fruits and not necessarily very representative of the benefits the project will bring on the long term, it feels important to me to have some things to show for regularly. Those user-facing changes were also a good occasion to get insightful feedback from other contributors, for instance with suggestions of support for branching project histories, which could totally be in scope for this project. I cannot guarantee that I will be able to retain this way of working at all times: there will definitely be phases where I need to work longer on some internal architecture, without user-facing changes being immediately available. I also acknowledge @tfmorris’ encouragement not to do peace-meal changes and aim for the end state more directly.

My next task is the support for partial change data serialization. Broadly speaking, this is about making it possible for OpenRefine to store intermediate results of long-running operations such as reconciliation, such that we can resume an interrupted reconciliation operation (#87) or show the first few reconciled rows before the entire project data is processed (#5541). Those are features where I will need more design input from the community - I will open specific forum threads for that.

I have also been working on documentation. I want to write a good architectural overview of the new architecture and a migration guide for extension developers, but before I get to that, it felt necessary to do a big clean up of the existing technical documentation. This is long overdue since the technical reference on openrefine.org was still a mostly disorganized copy of some wiki pages, with fairly outdated content. Given that we will likely have prospective interns coming to contribute, it also feels timely to get our house in order before that. I have also worked on improving the Javadoc for the new architecture. The goal is that we can publish Javadocs we can be proud of (#2311) and serve as complement to the high-level architectural overview in the technical reference.

Let me know if you wish for different priorities or ways of working. And get ready for more exciting user-facing changes, which should hopefully appear soon for your review

antonin_d · April 14, 2023, 2:00pm

Quick update on this. I have updated the tasks plan above to mark the tasks which are complete - meaning that there is a first implementation of them, not that it is final in any way.

More recent progress updates can be found in this thread: Partial results of long-running operations

This month I am not working on new features but rather stabilizing the code base and hunting for performance issues. More about that soon!

antonin_d · April 27, 2023, 11:08am

In April I have been working on stabilizing the code base and optimizing the loading of large files.
Specifically:

Records mode

I have reduced the need for counting the number of records in the project, by changing the criterion under which we turn on the records mode automatically. The new criterion should be more faithful to the project structure, meaning that there are less risks that we turn on the records mode on a project which is not meant to use this mode (#5661).

In addition, I have made it easier to compute the number of records at import time, during the initial pass over the dataset (which is required for other reasons, such as counting the number of rows and ensuring all rows have the same number of columns).

Overall, this means less passes on the entire dataset.

Better streaming from partitions

To enable parallelization of various computations (such as computing the state of facets), the grid is generally split into multiple partitions stored as distinct files on disk. The grid is read in a streaming fashion from those files.

This streaming was until now represented as a Java 8 Stream. The reason why I made this choice was that I needed an API which allowed:

constructing pipelines on a collection with lazy evaluation, meaning that the collections are only actually iterated from when necessary;
supporting a closing mechanism, to enable releasing underlying resources when a stream was no longer needed.

In theory, that is something Java 8 Streams provide. But in reality, the internal architecture they are built on means that:

There are cases when the stream gets buffered internally, even though it's not actually needed to perform the required computation. This is something that was hitting us: sometimes, an entire partition could get loaded in memory even though we just wanted to see the first 10 rows of the project.
There are standard stream constructs, such as zipping two streams together, which are just not available in the API and require switching back to iterators. Similarly for grouping items of a stream by batches of a given size.

So I have migrated away from such streams back to a custom interface, CloseableIterator, which is the combination of the Iterator<T> and AutoCloseable interfaces. There are tons of projects doing that: sadly it does not seem to be readily available in any standalone library as far as I can tell. To make this interface easier to use, I have based it on the Iterator interface offered by the Vavr library, which comes with plenty of methods to do sensible things on streams (including the ones above that are missing on streams). This library seems to be pretty well established and it feels okay to depend on it, but if we want to avoid depending on it it would not be too hard to re-implement those utility methods on our own.

Given how widespread CloseableIterator / CloseableIterable interfaces are, I think there is a case for including this in Vavr itself - I will approach the maintainers to see if they are interested in that.

To sum up, the benefits of this move are:

Noticeable performance gain for operations which store change data (such as reconciliation, fetching URLs…) as their use would slow down the iteration from the grid before this move, due to this spurious caching issue;
Better closing of files when they are no longer needed;
More readable code by avoiding the writing of custom iterators in many places (thanks to the richer API provided by Vavr)

Better importing infrastructure

There are still many formats for which we are only be able to import a file in OpenRefine if the original file can be loaded entirely in memory. The reason for this is that the Runner interface I designed when abstracting away from Spark in 2020 only made it possible to parse files efficiently when they were line-based. This was motivated by Spark's own API. This meant that efficient import could only happen when importing CSV/TSV files with certain importing options (or line-based / fixed width importers).

I realized that we could actually make it possible to import any iterable collection of rows. This means that any file that we can stream from can be loaded without much memory. The downside compared to Spark's API is that we do not take advantage of splitting the file: at import time, we are not able to parallelize reading the dataset. But this is only true for import time: once the dataset is imported as an OpenRefine, we can parallelize operations again. Therefore I have added methods on the Runner interface to create a Grid from a CloseableIterable of rows (so that the underlying files can be closed appropriately).

A side effect of that is that my lousy LineReader class which I had implemented to support splitting text files in the default runner implementation is not used anymore at all for CSV/TSV import and that's a big performance win too.

Concretely this means:

immediately enabling efficient import of all CSV/TSVs regardless of the import options
making it possible in the future to enable efficient import of Excel, JSON and XML files. For ODS files I do not think the underlying library would support it.

More optimizations and real-world testing

I have made various smaller optimizations in other places, generally motivated by manual testing on various large datasets I could find (for instance on open data portals). I have had a testing session with @thadguidry where he loaded datasets of his own in the prototype. I am reaching out to people who filed issues about large dataset handling to get their feedback on this new version.

Documentation

I have continued writing more documentation about this architecture, in three directions:

general documentation of the structure of the application. I still struggle to document parts which do not feel completely done yet, so that is holding me back a bit. For instance I think there is no point documenting the memory management strategy at the moment given that my plans about it are still evolving.
documentation about migrating extensions. I think I want it to be more example-driven than previous migration instructions, and I plan to migrate some third-party extensions myself and write down what I do as I go along.
documentation about writing new extensions. I think given the scale of the differences, it will often be easier for extension developers to refer to new, current documentation about how to write an extension from scratch, even if they want to migrate an existing one. Also, I want to bake in more information about how to set up your development environment to develop an extension, how to set up testing for it (including Cypress tests), and how to keep it compatible with newer versions of OpenRefine.

What's next

In May, I am planning to:

keep stabilizing the current prototype and aim to publish a 4.0-alpha2 release to ease external testing
keep writing documentation
improve the UX for the visualization of partial change data. Because it feels sort of necessary before a new release. I have some exciting prototype to share soon - stay tuned!
take some time to have a good think about the architecture to use to expose column dependencies of operations. There are various conflicting requirements I want to ponder and experiment with.

antonin_d · May 2, 2023, 1:50pm

And here is a very quick video to demonstrate some effects of that work:

This issue summarizes which operations can be run on datasets which do not fit in RAM:

github.com/OpenRefine/OpenRefine

Eliminate places where the entire grid is loaded in memory in 4.0

opened 02:54PM - 17 Apr 23 UTC

wetneb

bug large project support

In the 4.0 architecture, we add support for working on datasets which do not fit… in memory. This already works broadly speaking, but there are still use cases where the entire grid will be loaded in memory. Some of those can be avoided: if the effort to do so is reasonable, it is worth doing it. Some others cannot be avoided for various reasons. In that case we should make sure the corresponding operation is refused to the user ahead of time (so that we have proper error reporting to the user instead of dying with an `OutOfMemoryException`) or at least there is proper warning that the user could run into such an error. This is a tracker issue to identify all of the places where such work is still required. ## Operations * [ ] transpose operations. Some of those will unavoidably load the whole dataset in memory (we need a warning for that), but the key-value columnize operation should be able to avoid this. * [ ] clustering (although that's technically not an operation itself). There should be an option (similar to the one for facets) to only process a sample of the rows by default on large grids. * [x] all other operations (I think!) ## Browsing * [x] facet computation. Facets are only computed on a subset of the grid when the grid becomes large, so that the facet statistics are returned quickly. The user can configure how many rows/records are processed. * [ ] records browsing. This is already working fine on large grids, except if the grid contains an unreasonably large record (for instance spanning the entire grid). The size of records should be bounded by default (with the option to disable this bound) so that people do not accidentally crash OpenRefine by turning on the records mode on such grids. * [ ] sorting a grid. This loads the entire grid in memory systematically. ## Importers: * [x] CSV and other line-based importers do not load everything in memory * [ ] ODS can likely not avoid loading the entire file, but that's likely okay because those files have a limited length anyway * [ ] XLSX should take advantage of Apache POI's streaming support to avoid loading everything * [ ] JSON/XML importers should not need to load everything in memory * [ ] RDF importer should not need to load everything in memory * [ ] SQL importer should not have to load everything in memory * [ ] Wikibase edits preview and quality assurance should work on a sample if the dataset is too large ## Exporters * [ ] XLSX: we could take advantage of POI's streaming support again * [ ] ODS: likely impossible * [ ] Templating: check that we are already doing it in a streaming fashion * [ ] CSV/TSV: check that we are already doing it in a streaming fashion I am likely forgetting a lot of places which will show up during testing.

If you want to try it out, we have some snapshots available for that:

or by running it manually from the 4.0 branch.

antonin_d · March 6, 2025, 9:38am

I've updated the diagram above to reflect the current state of the implementation.
The green ticks are for features that are implemented in the master branch (or soon to be merged).

The white ticks are for things that I implemented in the delta branch, which require the scaling improvements (which I don't know how to introduce without a breaking change). Given that there is no consensus on such a breaking change, I don't have a plan to get those features released.

One exception is #5539, which I think would be still worth backporting (without deleting the Command classes for now, just deprecating them, to preserve compatibility).

Topic		Replies	Views
Partial results of long-running operations Development & Design	6	1279	May 15, 2023
Reproducibility project: November 2023 report Day-to-day project operations	5	389	February 19, 2024
Scheduling breaking changes we have on our radar Development & Design	14	399	February 14, 2025
Concurrency of long-running operations Development & Design	8	938	December 1, 2024
Reproducibility project: new approach Development & Design	5	89	August 27, 2024