Here is an update on this.
This month I have worked on various performance optimizations, which have resulted in much better speeds when working off-disk. Those optimizations are, in short:
- Getting rid of my home-grown line reader class, which was performing much worse than the standard LineNumberReader. I will detail in another post why I had introduced this alternative in the first place and why it’s no longer necessary;
- Migrating out of Java 8 Streams which I was using to represent streams of rows read from a partition. It turns out that those streams can, in some circumstances, buffer up the entire collection internally for architectural reasons even though that is not actually needed to execute the desired pipeline. This had a big impact and migrating to iterators helped get rid of this buffering issue;
- Avoiding re-counting of records as much as possible, by counting them at import time while we are doing a pass on the grid to compute other things and preserving that count after changes, when possible.
Those optimizations have little to do with memory management, but the reason why I am bringing those up here is that because reading project data from disk is much faster now, the need for in-memory caching has reduced.
I am now thinking that it might actually be an option not to cache things in memory at all. In fact, because in normal use cases we’ll be reading the project data files over and over, the operating system’s own mechanisms to cache frequently accessed files in memory might be enough to make it fast enough. If that could turn out to be true, it would be a big win, because the memory occupied by those cached files (or parts of files) is not managed by the JVM, so people would not really need to adjust the JVM memory settings anymore. Also, we would not need flaky heuristics to estimate how much memory we need to load the project grid in memory.
More testing and benchmarking is needed to tell. I have started asking for testing feedback from folks who had shown an interest about large dataset handling on GitHub and have had a session with @thadguidry to try out the current snapshot on large datasets. So far so good.
In any case, there will still be some on-disk caching required in some cases (meaning that some states of the project need to be stored explicitly, for instance after reordering rows).