Memory management in the 4.0 architecture

This month, I intend to focus on memory management for the 4.0 architecture.

This new architecture introduces the possibility to run OpenRefine workflows directly off disk, meaning that we are not required to store the entire grid in memory. That is useful for large datasets, but when working with a small dataset, there is no point working off the disk: it is much faster (and affordable) to load the whole dataset in memory. So far, the new architecture never attempts to load datasets in memory, and always works off the disk, so there is a very noticeable slowdown compared to 3.x.

The main question I want to address is: how to determine if we can afford to load a dataset in memory? Attempting to load it and catching any OutOfMemoryException that arises is clearly not an option. What I would like to implement is a simple heuristic to estimate the size a grid will take in memory once loaded. I want the heuristic to be simple because the tool should be able to determine efficiently if the grid should be loaded, and it is okay for the heuristic to be quite inaccurate.

At first, it could be as simple as the number of rows multiplied by some factor. The number of columns should obviously have a big influence, and then one could consider adding the number of reconciled columns (as cells with reconciliation data will cost more), the proportion of null cells in the grid, and probably other similar features.

My plan so far is to make various experiments, with a collection of real-world datasets in actual usage conditions, to gather a few data points. I am thinking of measuring the memory usage with ehcache/sizeof, but other methods could also be considered (heap dump, measuring the memory usage before and after loading with explicit GC calls…). And then try to fit some simple linear model to predict the memory usage out of simple to compute features.

Does that sound sensible? Not having a ton of experience with memory optimizations in Java I am grateful for any pointers.

Why not just leverage a library to handle this?

Apache Ignite with its persistence and caching ability. https://www.baeldung.com/apache-ignite And you don’t have to use SQL with it if you only want to work with Java objects.

Also, you might look at how https://github.com/redisson/redisson handles things. (2 things to note about redisson for us might be the Jackson JSON codec support as well as Caffeine.

There’s also https://infinispan.org/features

Also lots of helpful folks if you ask on the Apache Commons Mailing List dev-subscribe@commons.apache.org

Otherwise @tfmorris would be the goto guy, since I only can help later with performance testing through JRebel, etc. and reviewing the design.

(maybe looking at the sources for those projects above as well)

I am not really sure how you would delegate this memory management problem to an external library, apart from running the entire workflows in such an external system, which amounts to writing a new datamodel runner implementation with that system (just like we do for Spark).

The libraries/platforms you mention all seem to be designed for distributed contexts. One might be able to use them to write another datamodel runner implementation with them, but like Spark, they are unlikely to be usable for a local usage (OpenRefine’s current intended usee when the user just runs OpenRefine on a single, potentially modest machine). My current work is focused on making this local usage as fast as possible.

I have written a small prototype to evaluate the memory usage in this context, and it seems to work well enough for now. My next step will be to integrate the resulting heuristic in OpenRefine, and see if we get an acceptable UX.