Hosted uses of OpenRefine

OpenRefine is designed to run on a local machine and our documentation about running it as a remote server features some prominent warnings:

At the moment I see:

  • quite some interest from users in hosted versions: for instance, the documentation about OpenRefine on Wikidata recommends to run OpenRefine on PAWS, a cloud hosting service of the Wikimedia movement
  • some tensions in the dev team about how to approach this topic: whether we could officially support this use case, whether the impact of a particular change in a hosted use case should be taken into account, whether to accept a security advisory as a vulnerability, and so on.

From my perspective, we are struggling to find the right balance between meeting users' interest by advertising OpenRefine as a tool that can be hosted, and keeping users safe by warning them about the security implications of running a tool in an environment it is not designed for.

I think it is worth discussing this topic as an attempt to improve the situation.
Our documentation on this topic is not really satisfactory: we do not do a good job at explaining what sort of problems come with running a hosted OpenRefine. Also, I think it is worth not putting all use cases in the same bag. I see two aspects of the problem:

  • running OpenRefine's server on a different machine than the one the browser used to access it is running
  • having different users access the same OpenRefine instance, potentially concurrently

In my opinion, both of those come with their own issues, but those are fairly distinct. Do you see other useful distinctions to make between various "hosted" use cases?

I would like to document the known issues about using OpenRefine in those different contexts and improve the documentation accordingly. Once that is done, it would provide a good basis to classify security vulnerabilities, because those who fall into the scope of those known issues can be rejected accordingly.

Then, I would like us to reach an agreement on what sorts of use cases we are interested in eventually supporting, meaning that we welcome improvements that address usability or security issues in those contexts.

If you agree with the approach, let's start mapping the different sorts of hosted use cases and the issues we are aware of for them.

Thank you Antonin for bringing up this topic!

First off, I think there is a lot off misconceptions regarding hosted use cases. Given the little information out there on the matter it's not odd and I'm as guilty as anyone given that the lack of write-up about how we host and use OpenRefine(been on my todo-list for months now).

The issue that seems to cause the most tension is in my experience whenever, the API should only exist to serve the bundled GUI or not.

Another issue I find problematic and struggle with is settings/configuration, however, this is extra problematic for us because we can have up to a handful users using the same instance simultaneously. For a setup like the one powering PAWS this problem would be less of an issue. Illustrating one of Antonin's points above.

These two points are just illustrative examples, the first on because it keeps coming up and the second one because I can see it triggering more tension given how the current solution is modeled.

It's my personal opinion that addressing the core needs of hosted use cases would benefit all OpenRefine users. Hardening the client against connection failures, a stable and documented API etc. There is of course other work and patches that would have no to little benefit; systemd support, webhooks, various extension points*, etc but it's unlikely these would do any harm**.

I'm not aware of any tension regarding security or safety issues when it comes to hosted use cases. All hosters I know of run OpenRefine in isolated containers and bring their own access methods.***


I'm not sure how we move this discussion forward but if we believe that it would benefit from some proper stakeholder research I would be happy to either take the lead or provide the necessary funding.


*We don't actually maintain custom extension points downstream given it's downstream and only for us.

**Regarding maintainability, someone maintains these features anyway and without having to maintain custom patches that someone will have more time for other improvements(personally I would rather work on vanity efforts like dark-mode than on updating patches).

***All security related patches we apply except for one are actually more critical on local installs(assuming you don't run OpenRefine sandboxed). Containerization alone makes it that a hosted OpenRefine instance is likely more secure than the default distributions.

Thanks for following up on this!

Interesting, I would not have put this issue in the same category. For me, running hosted instances and using the API from another tool are fairly distinct issues. Can you elaborate more on how you rely on this?

Yes, multiple users using the same instance at the same time must be quite problematic. Even though OpenRefine isn't currently designed for that, I think it would be useful to document the problems you encounter. Do your users work on the same projects simultaneously, even?

As far as I am concerned, I am supportive of improvements that go towards easing those use cases. A multi-user, hostable OpenRefine will not happen overnight, but if we refuse any step in that direction because it does not solve the whole problem at once, then we are sure we will never get there.

For me, the main requirement is that those improvements do not decrease the usability in the current local, single-user use case. For instance, say we introduce a notion of user account in the tool and require people to be logged in to open a project. I would not want local users to be asked to create an account on their single-user OpenRefine instance when they install it, because it would be an unnecessary extra step in the start up process.

But there can be much easier changes which improve usability in your use cases and do not degrade the experience of local users. For instance it might make sense to move some settings away from our server-side PreferenceStore into a browser-side local storage - a change which would be mostly unnoticeable for our traditional user base.

You can only do so much with such an approach of course, but I would open to more ambitious changes, for instance making the PreferenceStore pluggable, making it possible to implement different versions of it (the existing one, and some multi-user one which would rely on some authentication mechanism exposed by the proxy behind which OpenRefine runs). Our preferences system is really poor at the moment anyway, so it's easy to imagine we could have something better.

I would find that really exciting!

Did you consider submitting those patches upstream? The new process to disclose advisories on GitHub is pretty nice and would let us review the issues privately.

I think it's in more then one category but I see it as the big "enabler" for hosted use cases especially in the ETL space, for us it's all about using the API to integrate OpenRefine with other software like our data pipelines or visualizations.

We have surprisingly few issues I would say. When someone works on a project it's marked as "occupied" to avoid that multiply people work at the same project at once, however, the main reason isn't technical but rather that it wouldn't improve anyone's productivity.

For instance, say we introduce a notion of user account in the tool and require people to be logged in to open a project. I would not want local users to be asked to create an account on their single-user OpenRefine instance when they install it, because it would be an unnecessary extra step in the start up process.

I completely agree and might even argue that accounts or access control** shouldn't be in OpenRefine but rather OpenRefine should provide a good set of extension points allow one to annotate project and edit history metadata. In general for this type of features I think focusing on extensibility rather than the feature itself is the best solution for all. A focus and commitment to the API, extension points, and possibly webhooks would go a very long way.

This is actually what we do, and then we "lockdown" server side settings for the things we don't want users to access. I could see some issues with trying to align this approach as some settings which local users are used to isn't available to them on our end(reconciliation services for example) but I could imagine that "settings-levels" could be a matter of configuration if it's ever enough of a use case.

You can only do so much with such an approach of course, but I would open to more ambitious changes, for instance making the PreferenceStore pluggable, making it possible to implement different versions of it (the existing one, and some multi-user one which would rely on some authentication mechanism exposed by the proxy behind which OpenRefine runs). Our preferences system is really poor at the moment anyway, so it's easy to imagine we could have something better.

Oh I'm getting ideas as I read, sounds like a case for that BarCamp!

Did you consider submitting those patches upstream? The new process to disclose advisories on GitHub is pretty nice and would let us review the issues privately.

Yes, but it's unlikely I will contribute in this area given how quickly it have gotten toxic and dismissive in the "past". Maybe in the future.

**Some native access control mechanism for the API could be useful, but it's a different question I believe.

I am totally supportive of people investing engineering effort in high performance, secure, versioned protocols & APIs, pluggable authentication mechanisms with SSO support, project sharing & access control lists, monitoring infrastructure, and whatever else is required by their particular definition of "hosted," as well as setting up testing for whatever configurations are supported.

What I'm not in favor of is backing into support commitments without explicit discussions, "badge engineering" support by just changing the label on the box rather than actually doing the necessary engineering work, or misleading customers about the support status of features. I'd LOVE to have multi-client support with rolling upgrades or an easily editable transformation language with good error reporting or any number of cool features, but just wishing doesn't turn a basic JSON undo history into a robust DSL or an internal protocol into a well designed API. Apparently, the current "API" documentation was arbitrarily imported from the refine-python project in 2016 turning it from completely unsupported and undocumented into some kind of quasi-documented and unsupported, but not clearly labelled as such, status. That's a terrible way to do things. Properly engineered protocols go through the more extensive design reviews that you'd expect of a public API and they have versioning, test suites, documentation, and all the other things that you'd expect.

Because "hosted" means so many different things to different people, I'd suggest not using the term and instead define specific configurations or attributes to be supported. A private ephemeral containerized OpenRefine server running on a high performance cloud instance is a very different beast from a multi-user, multi-tenant, shared service. Starting with a list of features / configurations would allow the effort for each to be sized.

Supporting a full development ecosystem is a significant additional effort on top of supporting an end user application and, honestly, I struggle to see how it's feasible considering how severely understaffed the project is, but hopefully it'll be self-supporting somehow. I'd certainly start by talking to commercial companies like RefinePro, Ontotext, etc who might have code and/or expertise. It'd be nice to see them start giving back instead of just taking.

Tom

I don’t think simultaneous users editing in the same project is approachable or warranted. Google toyed with this idea, prototyped, and gave up on the usefulness. Basic chat communication between users to organize and prioritize was found much easier and just as effective. But the intro of indicators in places might help team workflows in journalism or GLAM; dunno.

Regarding hosting: I think helping the community to continue to support containerization (docs, jdk testing, cli enhancements, etc.) is expected from us and the recent repo(s) idea is a good start. Later it might have Dockerhub publishing automation if community desires.

thanks @antonin_d for opening this conversation and @abbe98 for sharing about your current integration.

As a next step, I suggest that we create user stories in Github to identify various use cases, gauge interest from the community, and then determine the required technical implementation. After analyzing the user stories, we can make informed decisions on what we would like to invest in and support.

As members of the mentioned organisations are active members of the OpenRefine community, this final sentence feels unfair and counter productive. As a community I believe we can have honest discussions on this forum without being disrespectful, and I hope we can do that in this case.