This month, I intend to focus on memory management for the 4.0 architecture.
This new architecture introduces the possibility to run OpenRefine workflows directly off disk, meaning that we are not required to store the entire grid in memory. That is useful for large datasets, but when working with a small dataset, there is no point working off the disk: it is much faster (and affordable) to load the whole dataset in memory. So far, the new architecture never attempts to load datasets in memory, and always works off the disk, so there is a very noticeable slowdown compared to 3.x.
The main question I want to address is: how to determine if we can afford to load a dataset in memory? Attempting to load it and catching any OutOfMemoryException that arises is clearly not an option. What I would like to implement is a simple heuristic to estimate the size a grid will take in memory once loaded. I want the heuristic to be simple because the tool should be able to determine efficiently if the grid should be loaded, and it is okay for the heuristic to be quite inaccurate.
At first, it could be as simple as the number of rows multiplied by some factor. The number of columns should obviously have a big influence, and then one could consider adding the number of reconciled columns (as cells with reconciliation data will cost more), the proportion of null cells in the grid, and probably other similar features.
My plan so far is to make various experiments, with a collection of real-world datasets in actual usage conditions, to gather a few data points. I am thinking of measuring the memory usage with ehcache/sizeof, but other methods could also be considered (heap dump, measuring the memory usage before and after loading with explicit GC calls…). And then try to fit some simple linear model to predict the memory usage out of simple to compute features.
Does that sound sensible? Not having a ton of experience with memory optimizations in Java I am grateful for any pointers.