Improving the UX of extension install, and Butterfly

The problem of improving OpenRefine’s extensibility is becoming more pressing in my mind.

There are some things in OpenRefine for which I have a fairly clear mental picture of how things should be and I just lack the time to actually implement those ideas. For this problem I still have quite some open questions so let me write this down here, because I am a bit stuck.

What are the problems with OpenRefine’s current extension system?

  • The user experience of installing and using extensions is not great.
    • Downloading a zip file and unzipping it in a particular directory is not so easy. People are used to a full graphical experience, ideally with an online searchable catalogue listing extensions (such as extension stores for web browsers).
    • When upgrading OpenRefine, if some of the installed extensions are not compatible with the new version, OpenRefine will likely fail to start, without a clear error message.
  • The developer experience of creating and maintaining extensions is not great.
  • Extensions tend to be only compatible with fairly specific versions of OpenRefine, mostly because of a series of migrations we did between OpenRefine 2.8 and 3.3 (which were often forced on us). This is annoying both for users and developers. We have been a bit more stable recently but there can always be big looming changes we are not aware of yet.

Why those problems are becoming more pressing

  • The Wikibase extension has been growing in size and deserves to be in its own repository in my opinion. By moving an extension that is currently shipped with the main software to an external repository, we will ask people to install the extension manually, which is a new incentive to have a good user experience for that.
  • New extensions developed were developed by the OpenRefine team (such as the CommonsExtension), outside of the main repository, so our awareness of the pain points above has grown.
  • The new architecture introduced in the 4.0 branch is a massive compatibility break, so the detection of extension compatibility will be all the more important for this release. And because any changes to the extension mechanism of OpenRefine will most likely be breaking changes, perhaps this is a good opportunity to introduce them.

Possible solutions

Since our extension mechanism is determined by our use of Butterfly, the main question is what to do with Butterfly.

  • We could decide to migrate to something else that provides a better experience. This is a question that has been discussed many times (such as here). The problem is, I am not aware of any framework that would really be comparable. This is because applications like OpenRefine (browser-based but with a local server, and extensible) are rather uncommon, so it’s not a surprise there aren’t that many frameworks to support them.
  • In a sense, as the only known users of Butterfly, we are in a position to do pretty much what we want with this framework. We can change whatever aspect of it as we need. Whenever vulnerabilities are discovered (such as Log4Shell) we are able to patch it, and we control the release cycle so we can make it match ours. So we could decide to stick with it and just improve Butterfly so that it fits our bill. Which might involve rewriting large parts of it, though.

What do we actually need? From my perspective, the challenge is to find a system which lets extensions patch the backend and the frontend at the same time. We want users to be able to use our web UI to install an extension (probably packaged as some archive, but the user should not need to know that), and that extension should be able to provide new functionality both in the backend and in the frontend. This has a few implications:

  • Patching the backend is not too hard: this can be done by adding .class or .jar files provided by the extension to the Java classpath (or dynamically loading them via a classloader), and making sure the main application discovers the components exposed by the extension (for instance via the SPI and/or OSGi, or via explicit registration in a configuration file as is currently done).
  • Patching the frontend seems more difficult. The modern ways to combine JS modules together (using import statements and tools such as webpack) require external tooling that would need to be embedded in OpenRefine itself to perform those compilation steps at runtime, if we want to do better than just concatenating vanilla JS files together. Can we get away with something that does not require shipping the whole npm developer ecosystem to all our users? This blog post presents an example plug-in architecture for a single page app written with React, and there is another solution for Vue.js - it all looks pretty hacky so I wonder if there exist established versions of such things.

Where to learn about these things?

There are probably tons of relevant systems which try to solve similar problems and I am just not aware of them. Where do I find knowledgeable people to talk to? One thing I have been thinking about is to spend some time writing small extensions for major software platforms, and use this opportunity to learn about extension mechanisms used by successful software projects. I have been thinking about writing reconciliation-related extensions for mainstream spreadsheet software (LibreOffice, StarOffice, Google Sheets…) so that could be a good opportunity, but it is already fairly clear to me that the solutions adopted there will not be directly applicable to OpenRefine as long as we are using this weird “locally web-based” architecture.

1 Like

On Mastodon, Michael Lipp shared his JGrapes Web Console project which has tackled a very similar problem. He gives some details about how he solves the problem of injecting new assets on the front-end side, at runtime. On the back-end side, he uses OSGi too.

We should look at how NextCloud does things. They let users install “apps”, which can modify both the backend (PHP) and the frontend (vue.js). On the surface, this seems like something we could take inspiration from:
https://docs.nextcloud.com/server/latest/developer_manual/app_development/index.html

In the backend, apps are able to rely on PHP libraries but they do not have any isolation in place (a given library should not have different versions required by different apps running at the same time).
In the frontend, they use Webpack to create a Javascript file per app, using a Webpack configuration which (probably) avoids the inclusion of dependencies which are already supplied by the core.

NextCloud isn’t exactly a “locally web-based” tool like OpenRefine is but they let end users install and upgrade apps so it feels like this could be a fitting architecture for OpenRefine too.

1 Like

I keep thinking about this topic, as an important task that's on my back burner.

After looking at other projects (typically Nextcloud above, or this video about Pretix's plugin system more recently, I am more and more convinced that we should not look for an existing application framework that would fulfill our extensibility needs. Projects like Nextcloud or Pretix build their own plugin architecture and that's fine: the extensibility needs are different and it's helpful to be in control of that instead of being tied to an external dependency.

Perhaps one notable exception I am aware of is Gephi, which relies on the NetBeans platform (which comes with the plugin system). I'll try to reach out to the maintainers to check how happy they are with this, and whether they have suggestions for a web-based app like OpenRefine.

But generally I think we should just embrace Butterfly and revamp it to fit our needs. The immediate tasks I see are:

  1. migration to a recent version of Jetty
  2. migration to a declarative format for registering components provided by the extension (#5664)

The more long-running improvements (for which I don't know what it should look like yet) are:

  • isolation of CSS / JS code provided by the extensions (so that a JS error in an extension does not abort the entire app? We can probably not have full guarantees, but maybe there are ways to avoid the most catastrophic failures)
  • improvements about the way extensions can rely on additional libraries and avoid conflicts between those

Those two points seem to be problems that no-one claims to have fully solved as far as I am aware, so we also should not get roped up to much in them. As much as it makes sense to minimize migrations for extension developers, it's not blocking all improvements because we haven't yet figured out the perfect solution from the start.

I wonder what @tfmorris and @abbe98 think about that? What do you think about the phasing of such changes, would you be happy with first introducing 1. and 2. in some stable version, and having extension maintainers migrate to that first?

The Eclipse Foundation has some projects that utilize extensions so perhaps it would be wise to check in with them (no stone unturned).

But generally I think we should just embrace Butterfly and revamp it to fit our needs.

I'm all for this as I have ended up liking Butterfly on a conceptual level and I do not think it's a bad solution if given some love.

migration to a recent version of Jetty

Oh yes, I know of at least one issue on my end this would unblock.

migration to a declarative format for registering components provided by the extension

I'm all for it although I would suggest that a proposal for this declarative format is presented before one starts implementing it.

isolation of CSS / JS code provided by the extensions (so that a JS error in an extension does not >abort the entire app? We can probably not have full guarantees, but maybe there are ways to avoid the >most catastrophic failures)
improvements about the way extensions can rely on additional libraries and avoid conflicts between >those

Those two points seem to be problems that no-one claims to have fully solved as far as I am aware, so we also should not get roped up to much in them. As much as it makes sense to minimize migrations for extension developers, it's not blocking all improvements because we haven't yet figured out the perfect solution from the start.

I'm neither aware of any solutions, however, just documenting the practice of "namespacing" CSS (class prefixes) and JavaScript(window objects) could go a long way towards improving today's situation. It's on my todo-list for things to work on in core as I know it would resolve some issues there.

Can somebody guide me on how to contribute?

@anasadelopo great that you are interested in contributing! Please have a look here: Getting started | OpenRefine
and start a new thread if you need any help for your first contributions.

The thread here is about a rather technical subject which is not suitable for a first contribution.

Sorry I missed the ping on this thread. That approach sounds reasonable to me.

But generally I think we should just embrace Butterfly and revamp it to fit our needs.

I'm all for this as I have ended up liking Butterfly on a conceptual level and I do not think it's a bad solution if given some love.

Agree.

migration to a recent version of Jetty

Oh yes, I know of at least one issue on my end this would unblock.

I have branches with recent versions of Jetty. I'll check on their status.

migration to a declarative format for registering components provided by the extension

I'm all for it although I would suggest that a proposal for this declarative format is presented before one starts implementing it.

Agree with suggestion for design review before implementation (including extension developers), although I recognize that some prototyping using an existing extension may provide useful feedback.

I'm neither aware of any solutions, however, just documenting the practice of "namespacing" CSS (class prefixes) and JavaScript(window objects) could go a long way towards improving today's situation. It's on my todo-list for things to work on in core as I know it would resolve some issues there.

I think conventions such as namespaces are a perfectly good solution. Not everything has to involve code.

Tom

1 Like

So, apparently there has been a grant application made that posits that the extension API stability problem can be solved by a part-time junior contractor. I would like to respectfully disagree and argue that it will take careful API design, a commitment to maintain API stability, and the engineering discipline to follow through on the commitment. This is a task that needs to be agreed to and taken on by the core team.

Until there's a stable API that allows extension developers and users to have a predictable experience, spending time/money developing an extension store/marketplace/whatever makes no sense.

  • When upgrading OpenRefine, if some of the installed extensions are not compatible with the new version, OpenRefine will likely fail to start, without a clear error message.

That sounds like a straight out bug. What is the issue #?

In 2020, I did an analysis of (Java) API stability from 2.6 through through 3.3. Since that thread is stranded in our abandoned list archive, I'll reproduce the closing here:

If we're going to continue to attempt to maintain a stable Java API there are things that we can do to help ourselves here including:

  • being more conservative about visibility of things so that developers can use the public/protected/private visibility to understand what they can rely on and what they can't
  • don't make internal third-party classes/interfaces part of the API. We got burned by this severely with the json.org objects, so we shouldn't repeat the mistake with Jackson.
  • audit the public APIs for additional trouble spots
  • document our intent for how long we'll support interfaces, what developers can expect, etc

In addition to the Java APIs we've got other extension points that we've encouraged developers to write to including those for:

  • importers along with their associated file types, MIME types, and format guessers

  • exporters

  • commands & operations

  • UI menu items

  • extension modules (Butterfly) bundling some of the above

There are also various miscellaneous internal structures like:

  • operation history format (JSON)
  • preferences
  • templating exporter templates

So, which, if any, of these interfaces do we want to publish as stable for developers to use? What guarantees do we want to make? How much engineering effort are we willing to invest to make this supportable?

Not mentioned above, but another important potential API that people have requested be supported as a stable public interface is the internal REST API (and, probably by extension, its operation history payload).

This isn't a new conversation. Here's another thread about extension points from 2010, before I was running the project (also in the old dev list archive).

There are a number of different technical approaches which can be taken to manage stable APIs, but the primary requirement is making the conscious decision and commitment. Are we prepared to do that?

Tom

p.s. I'm sure there are plenty of other discussions on the topic in the list archive if folks want to dig them out.

So, apparently there has been a grant application made that posits that the extension API stability problem can be solved by a part-time junior contractor. I would like to respectfully disagree and argue that it will take careful API design, a commitment to maintain API stability, and the engineering discipline to follow through on the commitment. This is a task that needs to be agreed to and taken on by the core team.

I was surprised by this too. It's weird to me to see a grant application such as this one without first going through the core contributors, I would personally even expect there to be consensus around a design proposal before an application like this.

I will take the time to go through the number of false statements in your last message.

First, I assume you are referring to the DEF application we submitted yesterday (details are available in this PR). There is no need to comment on it without correctly naming it or inviting the authors for comment. I invite you to read the application more carefully since you make false assumptions and shortcuts that risk misleading the conversation and preventing a constructive discussion.

  • At no point do we reference the work to be done by a junior contractor. I suppose you are deriving this from the tentative mini-budget; this is much to extrapolate as this funding may be combined with other sources.

  • You suggest that things are already decided when we clearly outline a step to "Validate the scope of the changes done to the integration and impact for the migration with current extension developers."

  • I strongly recommend avoiding the use of terms like "store" or "marketplace" as they imply a monetary reference, which is not the project's intent.

  • I appreciate your effort to deflect the conversation on the API design and denigrate other works at the same time. Again, we clearly state that there will be community outreach and design phases through this project.

Through the rest of your message, I can sense your resentment toward the migration to Discourse (a topic I will not reopen here). One good thing about Discourse is that one can edit one's messages. I invite you to test the feature and revisit your message to keep the conversation constructive and welcoming to others.

I really appreciated your efforts to moderate your tone on the forum and GitHub issues in the last three months. Sadly, it did not last longer.

I apologize for the surprise regarding this grant. We worked on a short timeline, less than a month, and based our work on existing documents available as it is something we have been discussing previously in

Overall, the grant application process is interesting but uncertain. It often takes months to know if your grant has been approved. It is risky to put too much design into a feature with the condition of securing the grant to realize the implementation.

This is a bit of a chicken and egg problem. The way I would like to approach it is to work with a roadmap with different levels of granularity. First, we list the initiative we want to work on at a high level to see if we have a consensus around it. It provides our partners, funders, and ecosystems with the general direction of the project. Those projects are great candidates for a grant application (or partnership). We can delve into design and technical requirements once we secure the resources to work on the feature.

Today, I am aware that we struggle to reach a consensus on the project roadmap. It is part of the things I want to address this year, as outlined in this post Requesting Feedback: Documenting OpenRefine Community Handbooks - #9 by Martin

Finally, sorry to go off the initial topic of this conversation. Happy to split the thread into a new conversation.

1 Like

I don't think there's any question that improving the extension
ecosystem is important. It's been terrible since Day 1 and only gotten
worse. The question is how we go about doing it.

I would argue that the ordering should be:

  1. Decide what extension points we want to support & document the rest
    as deprecated
  2. Review/design the APIs that support those extension points with a
    careful eye towards long term maintainability
  3. Put additional technical infrastructure in place to support this such as:
    a. hiding internal implementation classes in private packages
    b. making an API compatibility review part of the release process, etc.
    c. bulletproofing extension loading
  4. Review 1-3 with existing and potential extension developers and
    revise as necessary
  5. Implement the above
  6. Update sample extension, tutorials, documentation
  7. Release the above together with any other breaking changes which
    are pending (Servlet API, etc)

These are the technical underpinnings for thriving ecosystems like:

https://chromewebstore.google.com/

Tom

1 Like

I would support a more open process around grant applications, so that more people have a chance to chime in. To me, the current process of the project manager and advisory committee steering that on their own is more an accidental habit than an intentional decision. We could talk more about that in a separate thread.

I think we all agree that this is not suitable work for a junior developer.

I completely support working on documenting extension points and their intended stability better. I think it doesn't need to be a ton of work, and we don't need to be overwhelmed by the stability commitments that represents. We can and will still make changes to extension points after that, it just requires the appropriate versioning and communication around those changes.

At the same time, I don't think this prevents us from working on the extension install UX too. Think about the following exercise: let's assume we wanted to improve how users install extensions, without changing anything about the extension mechanisms in themselves. We can already do quite a lot! First, we could have a page that lists the extensions currently installed, as Tom's PR (#6421) demonstrates. Adding support for uploading an extension archive on that page, for OpenRefine to unzip it in the expected location and restart, doesn't feel out of reach either. Deleting an extension from such a page feels doable too. Of course, that would still be pretty brittle: incompatible extensions would still break OpenRefine easily. But designing such an extension UI from the start would help us drive our changes towards concrete UX improvements. From this effort, we realize that we cannot easily let people disable an extension without restart, for instance. That can help us drive our changes to the extension mechanism accordingly (for instance, using a declarative list of the components that an extension provides would help us un-register easily all components registered by an extension).

Being able to release concrete UX improvements is also useful to maintain the trust of the broader community. When I talk to non-developers about improving the "extension mechanism", I often feel that they don't relate to what's in for them with such work. It's pretty abstract! Even if they have broken their OpenRefine in the past with an incompatible extension, it's not necessarily easy to understand how the changes you mention above will improve anything about that. It's easy for people to think this project is just driven by developers for developers, driven by an ideal of technical purity that doesn't match what users need. So I think it's worth finding the right balance there.

I don't have strong feelings about this, but I know that @abbe98 has expressed a preference for not bundling too many breaking changes at once (specifically, that the presence of the 4.0 branch is pushing us to postpone necessary breaking changes to be bundled with that architecture change). Taking inspiration from very mature ecosystems like the ones you mention above is a great idea, but I think it's fair to expect that we won't be able to just get it right in one try.

That's why I would put the focus on encouraging healthier development practices for extensions (with end-to-end testing and automated dependency updates) so to ease the burden of keeping extensions up-to-date with an actively developed base tool. Those are things that can help the developers of existing extensions, from day 1 of this effort, since we can already help them work on that without knowing what the future extension interface will look like.

1 Like

As a follow-up to the discussion here I have created a separate thread, about our process for grant applications: Process for coordinating grant applications

Just a note that I have looked into the plugin system used by Mattermost and it seems to tick all the requirements I have in mind. Their backend is in Go and their frontend in React, so it's not something we can directly adopt, but it looks like a good example to follow if we ever want to migrate to a reactive UI framework. In short:

  • plugins can extend both the frontend and the backend
  • they can be installed via a simple UI by the end user
  • they are based on a manifest file called plugin.json which gives declarative metadata about the plugin, such as the minimum version of the base software. It doesn't list the individual backend and frontend components though. For instance the frontend components are registered dynamically from an initialize method (see Web app quick start)

Because Go is a compiled language, it's not so easy to extend the backend: they do so with the go-plugin framework. It runs plugins as separate processes which communicate with the main process via remote procedure calls (RPC).
Arguably, for Java there are a lot more options thanks to dynamic class loading.

For me, the lessons learned are:

  • for an extensible UI, just including Javascript code from plugins alongside the Javascript of the core app is workable. It's also compatible with reactive frameworks like React as long as you configure webpack (or similar) correctly. All we need to do is make sure the plugins only do a simple call to a registration system, such as window.registerPlugin('my_plugin_name', new HelloWorldPlugin()) which means that we can then control if and when the actual initialization code of the plugin runs.
  • look at that plugin developer documentation! It's pretty amazing! If only we could have something similar…
  • if there are established plugin systems in Go, well, maybe we can also look for some in Java and avoid rolling our own stuff?

To follow-up on the last point: in the backend, our needs for dynamic extension loading, component registration and dependency isolation are really standard. Arguably in Java it's easier to implement your own little plugin system like Butterfly did, because we can use dynamic class loading. But why should we? Do we have the capacity to build and maintain something like that? I'd rather have us focus on the tool itself.

Arguably using an existing plugin system does add another dependency, which is obviously part of the extension interface, and we want to minimize those. But to me it's a simple choice: either we break extensions by upgrading that extension framework, or we break extensions because we realize our own hand-rolled plugin system isn't quite up to the task and we need to tweak things ourselves.

Concretely, what are the options? So far I could only find two:

  • OSGi, which has a reputation for being quite heavyweight and complicated
  • pf4j, which wants to be a lightweight alternative to it

I have to say I am quite tempted by pf4j, as it seems pretty reasonable. For instance, it supports dependencies between plugins, which is something we could make use of (say, the OpenRefine Wikibase extension could be extended by other plugins which add support for custom datatypes or other extensible features in Wikibase itself).

Not a Java developer, but I can say from what I heard from expert developers (previous co-workers at Ericsson) was that OSGi can be lightweight, and its service model is one of the best and makes it simple as writing POJOs with a few annotations. OSGi heard earlier complaints and has addressed them in years past especially though Declarative Services and Config Admin. The Apache Felix community was a joy to work with and I'd encourage anyone to start there and ask them deeper questions about our problems in OpenRefine.

So, it might depend on who you talk with or read with current state of things with OSGi and past versus current best practices.

I think a good way to start simply with OSGi and feel things out would be to hoist an OSGi service for Preferences, where not only OpenRefine, but also Extensions could use that service for load/saving/extending Preferences. And you wouldn't have to write much on your own... Apache Felix Preferences Service :: Apache Felix

1 Like

Looking at pf4j, I don't see any projects I recognize (or that have many GitHub stars) that use pf4j, except Appsmith and Dremio Code search results (github.com). But it is interesting in that link showing how Dremio uses pf4j similarly in how we could use it as well. Seems like it was used however in only Hive2 plugin and no where else? Still, very interesting.

1 Like