OpenRefine 2024 Barcamp: OpenRefine as a Service

We started the session (see Barcamp page) by refining its scope and decided to discuss separately OpenRefine hosting, multi-user capabilities, and exposing the OpenRefine API.

Hosting

@tfmorris: What is the end-use case for hosting: training vs. production use? Single user vs. multiple users? Through the conversation, the following use cases were identified for hosting OpenRefine:

  • Helping trainers
  • Production usage for data transformation
  • Testing changes proposed via a PR
  • Multi-users

@antonin_d: Some use cases can already be done, some would require changes to OpenRefine. Related discussion on the forum: Hosted Uses of OpenRefine.

Training Environment

During training sessions, installing OpenRefine on the user's machine can take up significant time. Trainers also face different machine setups and restricted permissions.

@ostephens: Is there a way to get user desktops to use OpenRefine to open certain file types as a standard service? Starting files in OpenRefine differs quite a lot from the usual way of other software products.

Production Environment

@jfaurel: Hosting for training and production usage are probably two different use cases.
@lozanaross: This is an institutional need because people have problems installing things on their work computers.
@mack-v: Data privacy concerns should be considered in general planning.
@b2m: More institutions are moving towards online workspaces, like online office and so on. How can we make OpenRefine a part of this? Moreover, hosted versions could better handle the RAM and CPU constraints of laptops.

Development Environment

@antonin_d: Mattermost ran a test instance with proposed changes as a sort of "preview" of the contribution on a pull request. This is another case of hosting. We automatically create snapshots after a PR is merged, but a live instance will reduce the friction for others to review a pull request since they won't need to build OpenRefine from a branch.

Part of workflow

For some user "OpenRefine as a service" isn't just about hosting. It is also about OpenRefine being integrated into workflows even if it is then running on your own computer or within your own organization as a service.

Multi-Users

Alicia: It's a need. Collaboration on projects is a similar use case. The current solution is exchanging projects, but people would like functionality similar to Google Sheets.
@ostephens: Differentiate between simultaneously working on the same dataset vs. working on the same dataset at different times.

Multiple Users Working on separate projects

@ostephens: The concept of "checking out" datasets via GitHub would allow working on the same dataset at different times. This would make it easier to document changes made to the dataset.
@thadguidry: Some universities use processes like this. A workflow similar to GitHub, where users can simultaneously work locally on the same data and then commit it later, might be useful in some cases, but not all. Locking whole datasets is a more pressing need.

Multiple Users Working on the same project

Multiple people working on a single dataset already have some forum threads about that:

@thadguidry: We could partition a dataset inside OpenRefine so that users do not have to export manually. The workflow would be to (1) create a Facet - (2) Lock the facetted rows in the project. This way, two or more users could see the other rows/columns and perhaps the live updated changes, but not edit the rows/columns locked out from them. This means we will need to partion a dataset between users.

Authentication - User Login

We need to introduce the concept of a user throughout OpenRefine. What would a user login process look like?
@antonin_d: Authentication should be pluggable because different users want to use different methods. We need to make sure that we don't degrade the experience of local OpenRefine users to support the remote use cases. Do we have to choose one option at one point?

The VIB-Bits build something to checkin / check out project .The user had to login first somehow and then they could check existing projects in/out which brought the file locally and then worked as normal OpenRefine project. The user had to remember to check the file back in manually. All users could see the status of each project and could see who had it checked out so they could contact the person if they had a project checked out that they needed.

Implementation

@jfaurel: Tried to host OpenRefine on a remote server, but there were issues with where the data came from when creating projects and exporting data.
@lozanaross: Running OpenRefine in Docker with single sign-on is possible; it would be nice to use this with multiple users.

Hosting on different infrastructure should not be limited to Azure, Google or Amazon. A lot of insitution run their own datacenter like TU Dresden Cluster or TIB.

We need to define all the environment variable that need to be configurable.

Public OpenRefine API

@tfmorris: A public API could replace a number of self-written workflows people are currently using, and there is demand for this.
@antonin_d: The OpenRefine API feels like a separate topic.