Developer and Community Engagement update: April 2025

Hi folks, I wanted to post an update on my work and what I'm thinking of working on next. Hopefully this gives the community an opportunity to check my work and comment on my plans for the next month. My main priority is to support the other OpenRefine contributors, so please don't hesitate to reach out if you see something unaddressed that you'd like to see improved.

April was largely focused on maintenance of the codebase. I was more active in issue and pull request triage, I released a new version of OpenRefine, and I kept researching the internals of OpenRefine. With each issue and pull request opened, I'm able to learn more about how OpenRefine works, which leads to these reviews taking less time and my reviews will hopefully be more helpful. On the technical side, I wanted to explore Proxy all (or most) reconciliation API calls through the backend · Issue #7185 · OpenRefine/OpenRefine · GitHub as a means of learning more about OpenRefine's architecture. However, this led to me spending much more time learning about Butterfly (the web framework OpenRefine uses) than reconciliation, as Butterfly seems more in need of knowledge sharing than other parts of the application architecture.

Looking ahead to May, I'd like my research of Butterfly to be useful, so I'm planning to write up some documentation on the framework. I'm also working towards the release of version 3.10, hopefully having a beta out towards the end of this week or early next week. Additionally, I'd like to spend time making a plan to stabilize the APIs for OpenRefine, both internally for extensions and externally for REST API clients.

While general feedback is always welcome, I'd especially like to hear about the following:

  • What do you find confusing about Butterfly? What kind of materials would be most useful for you (Javadocs, tutorials, etc)?
  • What do you find most frustrating about extending OpenRefine, either through the REST API or through an extension? I know there is a request for more guides on building extensions and the REST API is generally described as "use at your own risk", but anything specific would be very helpful.
  • I held office hours every Thursday in April, though attendance was generally low. I would like to hold office hours again in May but would appreciate any feedback about when and how people would prefer to engage.

I hope this information is useful! Please feel free to comment on the format of this update itself. I'd like to be more transparent and accountable to the community and I hope updates such as these help with that effort.

2 Likes
  • What do you find confusing about Butterfly? What kind of materials would be most useful for you (Javadocs, tutorials, etc)?

My largest issue with Butterfly is that it's separate from OpenRefine, no one else uses it but it has a separate release process and each time one need changes there to become unblocked in OpenRefine it just takes so much more time to get things rolling. It could just become a part of the OpenRefine repository at this point, which would allow it to use the same build system, test, etc as well.

What do you find most frustrating about extending OpenRefine, either through the REST API or through an extension? I know there is a request for more guides on building extensions and the REST API is generally described as "use at your own risk", but anything specific would be very helpful.

The API is in an odd place given that we don't really do breaking changes but we also say that people shouldn't use it. I think a way forward would be to give it the same level of support as for extension points and dependencies, it's the practice anyway and it would likely spark some interesting uses in the wild.

Extensions I think are in a better place, with the main issue being a lack of documented practices, many which would help avoid common issues(extensions impacting core features, layout, etc). CSS and JavaScript namespacing comes to mind.

Both Butterfly and API improvements with their surrounding discussions tend to quickly hit issues in regards to breaking changes which some core developers have been very vocally opposed to in the past. Maybe this is the time to revisit the question if this isn't a good time to rip the bandage off; improve the API, fix up some extension points, drop LESS support, bump Jetty, Velocity and other core dependencies. From my point of view that would be very welcome as I think the improvements that we could make greatly outweigh the burden of the breaking changes.

2 Likes

The following OpenAPI specification for OpenRefine is a few years old, incomplete, and created through request interception. I'm not sure if it's of much use to anyone but in case it is:

1 Like

Thanks for the OpenAPI spec! I was looking into that so this is a huge help.

Maybe this is the time to revisit the question if this isn't a good time to rip the bandage off; improve the API, fix up some extension points, drop LESS support, bump Jetty, Velocity and other core dependencies. From my point of view that would be very welcome as I think the improvements that we could make greatly outweigh the burden of the breaking changes.

I think this makes sense. We can't avoid breaking changes forever, and I think it would be helpful to have a plan around the frequency and nature of breaking changes (like bumping the minimum required Java version or upgrading a key dependency) so as to minimize the disruption when they inevitably happen.
One thing I would like to examine is the issue surrounding the Butterfly classloader: Butterfly classloader module isolation causes Jackson problems · Issue #15 · OpenRefine/simile-butterfly · GitHub
Decoupling OpenRefine's dependencies from those of the extensions seems like a worthwhile endeavor, as it would allow us to make some of these improvements without breaking extensions (or minimizing the damage if they do break).

1 Like

The biggest problem with the internal client/server protocol is not its design or lack of versioning, although those would both be important for a supported REST API, but its COMPLETE LACK of testing. The only way it is exercised is indirectly by the end-to-end tests which drive the web client. There is zero direct testing. There is zero testing via any client other than the web client. The JSON operation history payload is effectively part of this protocol and it isn't even tested by the end-to-end tests. Despite all these risks, I still see people recommending the use of these unsupported clients which depend on the internal protocol. Users who accept those recommendations are effectively declaring that they don't care about the integrity of their data, in my opinion.

This is all fixable, but it requires investment, which hasn't happened to date. This investment could come internally from the project or it could come from some third party which is interested in automation, but it needs to come from somewhere. There will also be an additional ongoing support cost, so decisions about what to support should be made with that in mind.

It would probably make sense to invest in at least one or two minimal client libraries (Javascript & Python?) to go with the officially supported REST API. For testing, we'd need to decide if we were going to only test at the client library API or also test at the HTTP level.

quickly hit issues in regards to breaking changes which some core developers have been very vocally opposed to in the past.

This sounds like it might be a reference to me (since there aren't that many core developers), but, for the record, I'm not opposed to breaking changes as long as we do them a) intentionally and b) in a thoughtful way. Of course, they should provide value as well -- hopefully in greater proportion than the pain they cause. Historically, a significant proportion of the OpenRefine releases have broken extensions, so it would be good if, as part of this revamping, we could engineer in enough stability to provide extension developers with some respite.

The following OpenAPI specification for OpenRefine is a few years old, incomplete, and created through request interception. I'm not sure if it's of much use to anyone but in case it is:

I went through that exercise a few years ago, but after looking at the trace, I came to the conclusion that massaging it into something useful might be more effort than starting from scratch.

Tom

Direct testing of the API, would be fairly simple but are there other approaches that would be useful? Especially with operations and data in mind? A tricky thing in my opinion would be to take the GREL versioning into account, but without it API versioning becomes kinda pointless.

It's not, and to be fair it's more of a sentiment around larger changes and investments. At one point I got frustrated enough that I offered to fund user research into automation and remote use, but even something indirect like that didn't go anywhere. Maybe some of it was because all things got overshadowed by the "4.0" work. Sadly in our case it has meant that most of the investments only have happened locally rather than upstream.

Indeed, there is a reason this is old and incomplete.

I think that my position could be used to provide some initial investment into stabilizing the REST API. The "policy" part of the API versioning proposal could include guidance around when and how to evolve the REST API.
As for API clients, I'd be happy to work on a Python library that can be used to both interact with OpenRefine and as a medium for API testing.

Words to the wise: don't try to work in isolation on a Python library. Get Python data wranglers and others members of the Python ecosystem involved.

1 Like

I was looking at the existing Python libraries listed on the extensions page but those are all explicitly archived or have been inactive for 10+ years. It might be worth building off of opencultureconsulting/openrefine-client since that was the most recently updated but maintain the option of starting a new one if that's preferable.

I think that my position could be used to provide some initial investment into stabilizing the REST API.

Yes, by "internal" I meant something funded by the Project "Advisory" Committee. This is a much bigger task than just "stabilizing" and I'm sure they'll want to know how much other stuff (features, bug fixes, etc) they will NOT get as a result of doing this before making a decision -- as well as what the additional ongoing support commitment will be. It's not a decision to be taken lightly. The first task would be to generate a draft task list and set of estimates to scope the effort.

As for API clients, I'd be happy to work on a Python library

Hopefully API clients can be generated from the OpenAPI description, in addition to documentation.

A tricky thing in my opinion would be to take the GREL versioning into account, but without it API versioning becomes kinda pointless.

We don't currently have any versioning of GREL, as you know, and I'm not sure exactly how that would work. There are a number tricky things: handling of extensions is another one, versioning of the operation history format, etc.

I found my copy of the OpenAPI trace and it has 28 endpoints documented (as compared to 32 for the gist), but a quick scan of the server sources shows 83 commands, not including any of the bundled extensions, so neither trace is even a very robust starting point, but it looks like there's a brand new rewritten OpenAPI DevTools called demystify which might be worth checking out. Inserting its MITM proxy into the end-to-end tests might get at least 80% coverage (or whatever we have in that test suite).

Tom

Indeed, downstream we have ended up with multiply copies of GREL to manage this but it's a rather terrible idea if one does not control all clients. On the other hand if GREL isn't stable a stable API is dramatically less useful.

I found my copy of the OpenAPI trace and it has 28 endpoints documented (as compared to 32 for the gist), but a quick scan of the server sources shows 83 commands, not including any of the bundled extensions, so neither trace is even a very robust starting point

I think there is a lot of individual operation commands that haven't been used for a while, the difference between our traces is probably because my trace is done on a downstream(before compatibility breakage, but with a few extra endpoints).