Helping users test PRs before they are merged

Compiling OpenRefine from source is not so easy: you need to have some familiarity with git, and install and configure Java, Maven and NPM on your machine.

I feel like the introduction of snapshot releases has had an interesting and positive impact on the project: because they are easier to install than compiling the tool from source, we have more people trying out new features before release. This is hard to quantify precisely of course, but I am hopeful that we see a fairly straightforward 3.7 release process thanks to that.

Still, this does not help users test pull requests before they are merged, since packaged versions will only be generated after merge. In the Wikimedia Commons integration project, I have the feeling this was a significant friction, and also introduced an unhelpful bias: I was sort of encouraged to merge PRs quickly, so that they could be tested by checking out the snapshot releases. That’s a problem: testing before merging would be much better.

Ideally I would like that anyone who opened an issue can easily check whether a PR that addresses it indeed solves their problem.

So I am wondering how to reduce this friction. I can think of the following options:

  • Make it easier to run OpenRefine from source. This is obviously a win for everyone. Surely we can improve the documentation about that, but it is not clear to me to what extent we can reduce the number of steps to take: we are already considering dropping the “feature” consisting in downloading Maven on the fly in the refine/refine.bat scripts, so that is going pretty much in the opposite direction. Potentially we could have some helpers scripts to help check out a pull request, perhaps (this is something Zulip does).
  • Use a similar GitHub Actions workflow to also publish built packages for pull requests. Those could then be advertised on the pull request, similarly to Netlify’s previews for the website and docs. The downside with that is that we would make PR builds heavier, and that this is generally quite a storage and bandwidth intensive thing to do. We are encouraging people to download hundreds of megabytes to review even a small change. But perhaps the lower entry barrier is worth it.
  • Come up with a way not to spin up OpenRefine in some cloud provider for a given pull request. We do not officially support hosting OpenRefine but for testing purposes I think that would be okay. We could have a link from each PR (added by a bot or as a PR check) to automatically spin up an instance of OpenRefine from that PR. This would require cloud resources but could potentially be less wasteful in the sense that it would only be done when requested. This could be inspired by the OpenRefine deployment on PAWS, although I suspect it relies on having a sort-of fixed Docker image for it.

What do you think?

We should do a build upon request to make less heavy using our CI and they can then download and test with their data. Hosting will give too many limits I fear for thorough testing.

But I think PR builds being heavier is OK. Can we manage our free tier well enough on GitHub if we go that route?

I’m opposed to making PR automatically publish packages, both to avoid heavier PRs and to avoid it being a potential vector for abuse.

I wonder, however, if we could have an action “listening” for a specific PR tag or comment that would trigger package publishing if made by a member of the Github org.

About abuse, note that PRs opened by new people do not trigger the CI automatically: a project member must approve them first.

Tagging explicitly the PRs for which such a system should be enabled feels a bit heavy to me: it’s another action that needs to be taken for each PR, and it feels redundant with this native Github security measure.

Good points. It would be interesting to see how much time it would add to current CI if enabled. I consider the current behavior rather annoying for minor PRs or drafts but if it doesn’t add much time…

Asking the user to build OpenRefine from the source will be a huge barrier for many non-technical users (a large part of our users don’t know how to use the command line).

Building from the PR branch is the most realistic short-term solution. The developer can trigger the build manually only when the PR is ready for review (so we avoid an automated system building at each commit). That process will be the first step toward hosting PR version of OpenRefine. Ultimately the artifact should be deleted once the PR merges.

Hosting is tempting; however, it introduces a lot of moving parts for us to manage (paying for computing and storage, maintaining the env., managing user credentials). All recent successful hosting versions of OpenRefine are based on jupyterhub.