Enhancing OpenRefine’s Undo/Redo Functionality for Large Datasets

Hello

One of the key features of OpenRefine is its undo/redo functionality, allowing users to revert changes made to a dataset. :innocent: However; when working with extremely large datasets, users often experience slow performance, high memory usage / even crashes when trying to undo multiple operations. This can make it difficult to experiment with transformations or backtrack efficiently in complex data-cleaning workflows. :slightly_smiling_face:

A potential improvement could be implementing a lightweight undo log, where only transformation steps (rather than full dataset snapshots) are stored, reducing memory overhead. Another idea is introducing checkpoint-based undo; where users can create manual restore points at key stages of their work, making it easier to revert without reloading large datasets. Additionally, better visual feedback on memory usage and performance impact when undoing operations could help users manage resources more effectively. :thinking: Checked https://librarycarpentry.github.io/lc-open-refine/09-undo-and-redo.html-Azure Course related to this and found it quite informative

Has anyone else faced performance issues with undo/redo in large projects? Are there existing workarounds or optimizations that could improve reliability? :expressionless:

Let’s discuss potential design improvements to make OpenRefine more efficient for handling massive datasets.

Thank you !! :slightly_smiling_face:

1 Like

Hi @kapet,

Welcome to the forum!
Yes, OpenRefine as it stands isn't able to handle large datasets very well. I have been working on improving this, by removing the assumption that projects fit in RAM. The approach is pretty much what you are proposing: instead of storing for each change the data required to apply or undo the change, we switch to storing full copies of the grid at certain points in the history of the project (which always includes the original state of the project, just after importing the initial dataset). The operations are then lazily computed on top of those snapshots.

I have implemented it (see this video for instance), but there are currently no chances of seeing those improvements released. Implementing this requires large changes in OpenRefine's backend, which will break compatibility with most extensions. There is currently no consensus in the team about if and how to make such breaking changes.

There are also other breaking changes that have been considered, such as upgrading Jetty (the web server used in OpenRefine), and there is also no consensus as to if and how to make them as far as I can tell.

Wait a sec...

4.0 large datasets etc.

There is consensus to move forward and roll those changes into 4.0 (breaking extension compatibility) - there was only 1 person, Tom, who is very occasional and lacking enough merit, in my opinion, to hold this back now.
We had agreed I thought that generally all could help extension authors that want to continue their extension migration with 4.0+ by writing up a general guide (some of which you already have done actually, and other parts yet to write down).

Jetty upgrades

And couldn't the Jetty changes be done by anyone (not just Tom)? We should make those changes as also part of 4.0 (or 4.1) whichever needs to come first in order to deliver large dataset support that our users have asked from us on our surveys past. His branch for working on changes for Jetty 11-12 is available to anyone, no? And the commit history is like only 1 or 2 in them, so easy to pick up from I think or just start from scratch by forking our 4.0 branch and working on Jetty 11 migration first? GitHub - tfmorris/OpenRefine at jetty11 and GitHub - tfmorris/OpenRefine at jetty12

@antonin_d Can you tell which changes need to come in which order?
Is it more ideal that your 4.0 changes happen and then Jetty 11 and then Jetty 12? or some more ideal order?

Meritocracy IRL

We definitely need to be careful of saying inclusion into the team (or out of the team - core team). And in my opinion merit counts first and foremost which you have continuously shown and proven. Being away from the project, at times, months or even years, as Tom has on occasion shouldn't stop progress of OpenRefine for other contributors and our users, and this has been stressed before by @Martin (our current Project Director) and myself.

Those two breaking changes are indeed completely independent, you could do them in either order.
But of course there are many more of them that would be sensible to schedule (I think), such as improving the isolation of extensions, to make it less likely that they break when upgrading OpenRefine. That generally means cutting off access to various functionalities (visibility of Java dependencies, Javascript entry points…), so generally breaking things.

This was discussed at length in this thread:

My reaction to the lack of consensus, by default, is to avoid making any breaking changes, and spend the remaining time I have on this project by working only on things that can be released seamlessly. It does feel like giving up on finding a consensus, which is sad of course, but I fear we simply don't have the team cohesion and quality of communication that's required to reach one. And there's a fair case for saying that OpenRefine simply is legacy software, that we want to maintain as is. Major improvements to it are perhaps a better fit for a ground-up rewrite, so a different project.

1 Like