GIthub workflow for chunked uploads to Wikimedia Commons

As you may have seen on GIthub, I've done a few things in relation to #4303. There were some false starts, partly because I'm not that used to the Github workflow. We're also using a bit of a particular structure for the work on the issues.

The idea is to have a branch for the feature (issue-4303-chunked-upload) onto which will be merged commits as parts of the feature are done (chunked-uploads is the first of these). Once all required commits are in the feature branch will be merged.

The reasoning behind this is that there will likely be stages in the development where you shouldn't use the new functionality even if it's technically possible. E.g. you can upload in chunks, but SDC isn't added, which means the upload to Commons may be faulty.

If you're interested in the planned breakdown of the feature, there's a list in #6505. This PR is no longer active and will be replaced, but the list should remain (unless something comes up during development🙂).

1 Like

The idea is to have a branch for the feature (issue-4303-chunked-upload) onto which will be merged commits as parts of the feature are done (chunked-uploads is the first of these). Once all required commits are in the feature branch will be merged.

The reasoning behind this is that there will likely be stages in the development where you shouldn't use the new functionality even if it's technically possible. E.g. you can upload in chunks, but SDC isn't added, which means the upload to Commons may be faulty.

I'm not sure I understand what this extra level of indirection buys you. The issue-4303-chunked-upload branch seems to live in the main OpenRefine repository which seems unusual. If it lived in your fork, you could commit to it whenever you wanted and when the entire feature is ready propose the branch as a pull request. This would seem to address the atomicity concern that you have (people using a partially complete feature). It would also make sure that the documentation got updated along with the code, even though they're separate items on the checklist.

Tom

p.s. What does "SDC" mean in this context?

Having a feature branch in the main repo is meant to make the development more visible. @antonin_d, was there any other reason we decided to do it this way? This is the first commit, so if we run into any unforeseen issues way may have to change things a bit.

SDC is short for Structured data on Commons. In this instance the important things is that it's added in a separate request, after an image has been uploaded. Since it may contain important info, like license, we don't want people using the feature before it's included.

I think the original workflow you proposed was to have the feature branch in your fork and make PRs within your fork to that feature request, so as to get review feedback gradually and not just when merging the feature branch in the official repo. My reluctance here is that I think the reviewing should happen in the main repository, not on your fork, so that it is visible and others have a chance to chime in. I think that's not possible if the feature branch is in your repository.

It's a fine line to tread on, but what I am trying to do here is to meet two conflicting needs:

  • I want to be flexible and be open to adopting workflows that suit you, so that you have a good time working on OpenRefine, can get productive and want to do more of it in the future
  • I want to encourage you to integrate meaningfully in the OpenRefine project, which means adopting its existing practices and communication channels, so that this collaboration isn't too dependent on me being a single point of contact between WMSE and the OpenRefine devs.

But that being said I don't know how realistic it is that others chime in on the reviewing on this particular PR, so maybe I'm erring too much on the second side and should just accept to do the reviewing of individual steps in your fork. We could then switch to the "standard" way of doing things in OpenRefine for further changes, which should hopefully be easier to do atomically.

Oh, I didn't realize that Antonin was involved in deciding this. My
interest/expertise is in data wrangling rather than picture uploading,
so I'm not going to be able to contribute to this in a meaningful way,
but regardless of where the branch is hosted, the key review gate is
when things are ready to be merged into the master branch.

I'm fine with continuing this project using this methodology to see
how it works out, but an alternative to consider for similar future
efforts would be to use GitHub's Draft PR mechanism, but explicitly
request feedback, even though it's marked as "Draft." I guess part of
what determines that appropriate workflow is the scale of the work.
Something which is a few days or a week or two is going to have
different requirements than an effort which is months long. In
general, we want to keep the reviewable chunks (ie branches) tractable
for the reviewers.

Something we haven't used in the past, but we might want to consider
for large multi-part features is the use of "feature flags" to
enable/disable them at run time. This would allow incomplete features
to be disabled/hidden.

Ah yes, if the context wasn't clear, there is this introduction thread:

I am involved in the sense that I am liaising with Wikimedia Sweden (with weekly calls) to answer any questions they have about contributing to OpenRefine :slight_smile:

An update for those keeping an eye on this post. Chunked uploads is now included in the snapshot release (since Snapshot #2370). Please give it a spin an give us a shout if there are any unforeseen issues.