The problem of improving OpenRefine’s extensibility is becoming more pressing in my mind.
There are some things in OpenRefine for which I have a fairly clear mental picture of how things should be and I just lack the time to actually implement those ideas. For this problem I still have quite some open questions so let me write this down here, because I am a bit stuck.
What are the problems with OpenRefine’s current extension system?
- The user experience of installing and using extensions is not great.
- Downloading a zip file and unzipping it in a particular directory is not so easy. People are used to a full graphical experience, ideally with an online searchable catalogue listing extensions (such as extension stores for web browsers).
- When upgrading OpenRefine, if some of the installed extensions are not compatible with the new version, OpenRefine will likely fail to start, without a clear error message.
- The developer experience of creating and maintaining extensions is not great.
- The list of intended extension points is not documented. In general, official documentation is non-existent.
- The extensibility of the frontend is very brittle: we just include all Javascript code exposed by the extensions in the main JS bundle served to the user. Any error in the extensions can halt the entire execution of the app. The extensions are able to manipulate the DOM in fairly arbitrary ways.
- Our extension mechanism relies on Butterfly, an application framework which is long abandoned - as far as I am aware, OpenRefine and its derivatives are the only user of that framework.
- Extensions tend to be only compatible with fairly specific versions of OpenRefine, mostly because of a series of migrations we did between OpenRefine 2.8 and 3.3 (which were often forced on us). This is annoying both for users and developers. We have been a bit more stable recently but there can always be big looming changes we are not aware of yet.
Why those problems are becoming more pressing
- The Wikibase extension has been growing in size and deserves to be in its own repository in my opinion. By moving an extension that is currently shipped with the main software to an external repository, we will ask people to install the extension manually, which is a new incentive to have a good user experience for that.
- New extensions developed were developed by the OpenRefine team (such as the CommonsExtension), outside of the main repository, so our awareness of the pain points above has grown.
- The new architecture introduced in the 4.0 branch is a massive compatibility break, so the detection of extension compatibility will be all the more important for this release. And because any changes to the extension mechanism of OpenRefine will most likely be breaking changes, perhaps this is a good opportunity to introduce them.
Possible solutions
Since our extension mechanism is determined by our use of Butterfly, the main question is what to do with Butterfly.
- We could decide to migrate to something else that provides a better experience. This is a question that has been discussed many times (such as here). The problem is, I am not aware of any framework that would really be comparable. This is because applications like OpenRefine (browser-based but with a local server, and extensible) are rather uncommon, so it’s not a surprise there aren’t that many frameworks to support them.
- In a sense, as the only known users of Butterfly, we are in a position to do pretty much what we want with this framework. We can change whatever aspect of it as we need. Whenever vulnerabilities are discovered (such as Log4Shell) we are able to patch it, and we control the release cycle so we can make it match ours. So we could decide to stick with it and just improve Butterfly so that it fits our bill. Which might involve rewriting large parts of it, though.
What do we actually need? From my perspective, the challenge is to find a system which lets extensions patch the backend and the frontend at the same time. We want users to be able to use our web UI to install an extension (probably packaged as some archive, but the user should not need to know that), and that extension should be able to provide new functionality both in the backend and in the frontend. This has a few implications:
- Patching the backend is not too hard: this can be done by adding
.class
or.jar
files provided by the extension to the Java classpath (or dynamically loading them via a classloader), and making sure the main application discovers the components exposed by the extension (for instance via the SPI and/or OSGi, or via explicit registration in a configuration file as is currently done). - Patching the frontend seems more difficult. The modern ways to combine JS modules together (using import statements and tools such as webpack) require external tooling that would need to be embedded in OpenRefine itself to perform those compilation steps at runtime, if we want to do better than just concatenating vanilla JS files together. Can we get away with something that does not require shipping the whole npm developer ecosystem to all our users? This blog post presents an example plug-in architecture for a single page app written with React, and there is another solution for Vue.js - it all looks pretty hacky so I wonder if there exist established versions of such things.
Where to learn about these things?
There are probably tons of relevant systems which try to solve similar problems and I am just not aware of them. Where do I find knowledgeable people to talk to? One thing I have been thinking about is to spend some time writing small extensions for major software platforms, and use this opportunity to learn about extension mechanisms used by successful software projects. I have been thinking about writing reconciliation-related extensions for mainstream spreadsheet software (LibreOffice, StarOffice, Google Sheets…) so that could be a good opportunity, but it is already fairly clear to me that the solutions adopted there will not be directly applicable to OpenRefine as long as we are using this weird “locally web-based” architecture.