Memory management in the 4.0 architecture

This month, I intend to focus on memory management for the 4.0 architecture.

This new architecture introduces the possibility to run OpenRefine workflows directly off disk, meaning that we are not required to store the entire grid in memory. That is useful for large datasets, but when working with a small dataset, there is no point working off the disk: it is much faster (and affordable) to load the whole dataset in memory. So far, the new architecture never attempts to load datasets in memory, and always works off the disk, so there is a very noticeable slowdown compared to 3.x.

The main question I want to address is: how to determine if we can afford to load a dataset in memory? Attempting to load it and catching any OutOfMemoryException that arises is clearly not an option. What I would like to implement is a simple heuristic to estimate the size a grid will take in memory once loaded. I want the heuristic to be simple because the tool should be able to determine efficiently if the grid should be loaded, and it is okay for the heuristic to be quite inaccurate.

At first, it could be as simple as the number of rows multiplied by some factor. The number of columns should obviously have a big influence, and then one could consider adding the number of reconciled columns (as cells with reconciliation data will cost more), the proportion of null cells in the grid, and probably other similar features.

My plan so far is to make various experiments, with a collection of real-world datasets in actual usage conditions, to gather a few data points. I am thinking of measuring the memory usage with ehcache/sizeof, but other methods could also be considered (heap dump, measuring the memory usage before and after loading with explicit GC calls…). And then try to fit some simple linear model to predict the memory usage out of simple to compute features.

Does that sound sensible? Not having a ton of experience with memory optimizations in Java I am grateful for any pointers.

1 Like

Why not just leverage a library to handle this?

Apache Ignite with its persistence and caching ability. https://www.baeldung.com/apache-ignite And you don’t have to use SQL with it if you only want to work with Java objects.

Also, you might look at how https://github.com/redisson/redisson handles things. (2 things to note about redisson for us might be the Jackson JSON codec support as well as Caffeine.

There’s also https://infinispan.org/features

Also lots of helpful folks if you ask on the Apache Commons Mailing List dev-subscribe@commons.apache.org

Otherwise @tfmorris would be the goto guy, since I only can help later with performance testing through JRebel, etc. and reviewing the design.

(maybe looking at the sources for those projects above as well)

I am not really sure how you would delegate this memory management problem to an external library, apart from running the entire workflows in such an external system, which amounts to writing a new datamodel runner implementation with that system (just like we do for Spark).

The libraries/platforms you mention all seem to be designed for distributed contexts. One might be able to use them to write another datamodel runner implementation with them, but like Spark, they are unlikely to be usable for a local usage (OpenRefine’s current intended usee when the user just runs OpenRefine on a single, potentially modest machine). My current work is focused on making this local usage as fast as possible.

I have written a small prototype to evaluate the memory usage in this context, and it seems to work well enough for now. My next step will be to integrate the resulting heuristic in OpenRefine, and see if we get an acceptable UX.

Here is an update on this.
This month I have worked on various performance optimizations, which have resulted in much better speeds when working off-disk. Those optimizations are, in short:

  • Getting rid of my home-grown line reader class, which was performing much worse than the standard LineNumberReader. I will detail in another post why I had introduced this alternative in the first place and why it’s no longer necessary;
  • Migrating out of Java 8 Streams which I was using to represent streams of rows read from a partition. It turns out that those streams can, in some circumstances, buffer up the entire collection internally for architectural reasons even though that is not actually needed to execute the desired pipeline. This had a big impact and migrating to iterators helped get rid of this buffering issue;
  • Avoiding re-counting of records as much as possible, by counting them at import time while we are doing a pass on the grid to compute other things and preserving that count after changes, when possible.

Those optimizations have little to do with memory management, but the reason why I am bringing those up here is that because reading project data from disk is much faster now, the need for in-memory caching has reduced.

I am now thinking that it might actually be an option not to cache things in memory at all. In fact, because in normal use cases we’ll be reading the project data files over and over, the operating system’s own mechanisms to cache frequently accessed files in memory might be enough to make it fast enough. If that could turn out to be true, it would be a big win, because the memory occupied by those cached files (or parts of files) is not managed by the JVM, so people would not really need to adjust the JVM memory settings anymore. Also, we would not need flaky heuristics to estimate how much memory we need to load the project grid in memory.

More testing and benchmarking is needed to tell. I have started asking for testing feedback from folks who had shown an interest about large dataset handling on GitHub and have had a session with @thadguidry to try out the current snapshot on large datasets. So far so good.

In any case, there will still be some on-disk caching required in some cases (meaning that some states of the project need to be stored explicitly, for instance after reordering rows).