Branch structure to maintain 3.x and 4.x simultaneously

In the context of my work on the new architecture in the 4.0 branch, I would like to reach a consensus on a branch structure. I have been working on restructuring this branch as outlined in this post and the next stages will be commits that diverge from the master branch as they introduce breaking changes. This brings up the question of which branch structure we want to adopt.

So far, our branching structure for our releases looks like this:

 | (master branch)
 |
 |
 + 3.7-beta1
 |\
 | \
 |  + 3.7-beta2
 |  |
 |  …
 |  |
 |  + 3.7.0
 |  |
 |  …
 |  |
 |  + 3.7.9
 |  |
 |  | (3.7 branch)
 |
 + 3.8-beta1
 |\
 | \
 |  | (3.8 branch)
 |
 | (master branch)

I would intuitively like to make it possible to keep maintaining the 3.x release series while publishing releases in a 4.x series, which would also be open for general contribution (such as #6460).

In my experience, most contributions to our current master branch can be merged into our current 4.0 branch without much trouble. So I would like to make it possible to keep contributing on 3.x and I would take care of merging those changes in the new architecture.
This could be done either with a merge commit or by rebasing those further commits on top of the new architecture.

I would intuitively propose something like this:

  • we delete the current 4.0 branch which is branching off from too far back in the past
  • the master branch remains the branch from which we make the 3.x series. It remains the default branch of the repository
  • the 4.x branch branches off master soon. From this branch, we have other branches for each minor release (4.0, 4.1 and so on), just like we do for the 3.x releases
  • I regularly merge the master branch into 4.x (rebasing could also be doable, but would probably be more work and would duplicate commits)

Once the new architecture is sufficiently adopted, it could become the default branch of the repository, which would not preclude us from continuing to maintain 3.x (for instance with a specific 3.x branch).

Do you think that would be workable? Or would you expect a different workflow?

Are you sure that picture is right? We never merged code from 3.7.9 back into mainline?

I think that approach is the right path forward; it's something my own teams have done on projects in the past. At some point, you definitely will get tired of merging master branch into 4.x and from what I've seen, it happens pretty quickly :slight_smile: (no one likes continuing the extra extra work for very long - but we'll see how long you last :wink: ).

The refactoring of the "4.0" branch that you outlined before, I basically see as creating a number of feature branches which can be reviewed in manageable chunks and merged into master. A better name for that branch might have been "futures" or "prototype" or something along those lines, because it's difficult to do early binding to version numbers and get it right. I'm not familiar enough with all the pending stuff to have an opinion on whether it's one mega release or a couple of medium size releases or something else and whether those release(s) should be 4.0 and 5.0 or 4.0, 4.1, ...

We know that there's a bunch of breaking stuff on your branch and we know that we'd like to improve the extension API/ecosystem, as well as update to more recent versions of Jetty. There are other items which affect serialization formats like getting classnames out of the JSON. There are probably other breaking or potentially breaking things on the horizon that I'm not thinking of. Laying all that out in a single list and figuring out how to chunk it into releases would be a worthwhile exercise.

Once the new architecture is sufficiently adopted, it could become the default branch of the repository

If that's another way of saying that you want to rename the 4.x branch to be called master at some point, that's not something that I'd support. That assumption was specifically what I objected to earlier in the discussion about breaking up the work into reviewable chunks (and I believe Albin did as well). Once that work reaches its logical conclusion, everything should have been merged into master and the 4.0, 5.0, etc releases can be made from it.

Tom

I must echo all of what Tom is saying above.

If that's another way of saying that you want to rename the 4.x branch to be called master at some point, that's not something that I'd support. That assumption was specifically what I objected to earlier in the discussion about breaking up the work into reviewable chunks (and I believe Albin did as well).

That is not what I mean. I would not want to keep the current 4.0 branch, I am talking about a different branch 4.x which would be introduced to make it possible to maintain both the 3.x and 4.x release series together for a while.

everything should have been merged into master and the 4.0, 5.0, etc releases can be made from it.

My question is: imagine we merge today the first breaking change that I separate from my current 4.0 branch into master (for instance, changing from com.google.refine to org.openrefine). If we keep the same branching structure as we have been using for the past release, this would mean the next release should be called 4.0 and that we would not publish a 3.9 release, which would not be great.

Given that the new architecture I am working on requires a different format to serialize projects, in such a way that projects from the previous format cannot be fully supported, I would be interested to support 3.x for a while longer. It's also worth doing that to ease the pressure on extensions to follow up on those breaking changes.

The question I am trying to bring up in this thread is whether we could adopt a development workflow that would make this possible: can we maintain a 3.x and 4.x release series in parallel? What would the branch structure and porting process look like?

I am continuing to work on restructuring my current work to make it more reviewable and get as much as I can merged in master, but that's a slightly different topic.

Is that any clearer?

First off, I think it's problematic for both this and other discussions that we call the branch in question 4.0, not only in the context of this branch and its future but also given (much needed) breaking changes coming from others.

Secondly I think that this discussion is problematic as it is all centered around a monolithic branch which no other maintainer than @antonin_d knows and which no other maintainer can familiarize themself with, within any reasonable effort, given it's size and scope.


If we had a set of breaking changes/branches for which we had consensus to merge I don't think there would be much of an issue with deciding a point in time for which we would make a 3-*-series copy of master and start merging.

First off, I think it's problematic for both this and other discussions that we call the branch in question 4.0, not only in the context of this branch and its future but also given (much needed) breaking changes coming from others.

I agree the name is confusing, I should have chosen a different branch name. That's why I am proposing to drop it.

Secondly I think that this discussion is problematic as it is all centered around a monolithic branch which no other maintainer than @antonin_d knows and which no other maintainer can familiarize themself with, within any reasonable effort, given it's size and scope.

I struggle to see how having a discussion about this can be problematic. Would you prefer that I do not bring this topic up at all?

Intuitively, it feels important to me to have this discussion, especially since you have concerns about my work. I want to hear them and discuss with you to see if and how they could be addressed.

I do think it would be possible for others to get familiar with the new architecture I have been working on. So I have tried to transparent about this work, providing monthly reports describing the direction I am taking and giving higher-level overview of architectural changes in various forms. I got very little feedback on this so far, so I wonder if the format or content is not adapted.

I am totally supportive of mapping out the breaking changes we have on our radar and deciding how to schedule them. Should we do that here? Do you have a particular process in mind?

I think this is a useful discussion and appreciate having it. I'm going to christen the prototyping branch, "Delta", so that we have a non-numerical way to refer to it. Not to put words in his mouth, but I think Albin's concern is a perceived assumption that at some point in the future the goal is for Delta to be renamed "master.' I share the concern and think it would be a bad idea.

My principal for the master branch is simple - it should never contain any discontinuities. It should represent a linear history from the time of SVN through the migration to Git and on into the future. It is the base from which we branch each major release.

Here's a thought exercise. Suspend disbelief for a minute and assume we've done our chunking exercise and that the following is true. We have four feature "chunks":

Unicorn - com.google.openrefine.* -> org.openrefine.* and package remodularization and Servlet 6 with associated Butterfly changes
Dragon - new project serialization format and some set of stuff that depends on it
Griffen - versioned REST API with protocol cleanup and unified status/error reporting
Raptor - updated extension APIs

and let's say that we figured out a way to implement Griffen and Raptor such that existing extensions continue to work and can update on their own schedule. Unicorn and Dragon are breaking changes and become 4.0 by definition. If we release them separately, the first becomes 4.0 and the next becomes 5.0, but if we bundle them together, there's just 4.0. Griffen and/or Raptor can go in 3.9 or 4.0 or 5.0 -- separately or together. Again, this is based on the assumption that we figured out a way to implement them without breaking existing extensions.

The details aren't important for this exercise, just that there is some number of logical chunks and that some of those chunks depend on each other and some are independent. The more we can make tem independent, the easier we'll make our release planning, but there will be some things which are naturally coupled together or ordered sequentially. Perhaps we decide on Servlet 4 instead of Servlet 6 and put it in Raptor instead of Unicorn. The important concepts are the chunking, the dependencies between the chunks, and late binding to releases. Doing builds from the Delta branch to let people play with the functionality is cool, but the actual 4.0 release should be done from master (and may look nothing like what Delta looks like).

Separately from release planning, I'd like to understand and test this statement more:

the new architecture I am working on requires a different format to serialize projects, in such a way that projects from the previous format cannot be fully supported [emphasis added]

If that means that there's no upgrade path for users, that seems problematic. It would represent the first time that we've had a release that didn't seamlessly upgrade users' workspaces. That is bound to significantly hinder adoption, so is a limitation that we should work really hard to remove if at all possible.

Tom

1 Like

I would never suggest that you or someone else does not bring up a particular topic. I find the framing around your branch problematic, it would have helped the discussion to have feature or another subset as a more realistic example.

I got very little feedback on this so far, so I wonder if the format or content is not adapted.

To me the showcases are very interesting and do aid me in understanding the scope of your work. It's however, not helpful for actually reviewing or even judging the work itself. We have requested feature branches for several months, as previously stated I would be happy to give feedback but let's do it on the same terms as with all other contributions.

@abbe98 Well, then we probably are on the same page? Concerning breaking up my work into separate feature branches, this is already something I have been working on for a while, with the first parts being merged in master lately.

@tfmorris about project serialization change, currently what I have is a way to import projects in their current format but discarding the project history, which is still quite disruptive.

There could be ways to improve on that, for instance by making adaptations in the current serialization format (in a non-breaking way) to make it easier later on for the new architecture to preserve the history with some caveats (such as, changing the position in the undo history would trigger recomputations of operations - which is still not great). My gut feeling is that this is a risky route to take, because it's likely to introduce corruption problems due to subtle differences between the implementations. I think it's safer to let users continue using OpenRefine 3.x to read their older projects reliably, and not deceive users with an unreliable backwards compatibility. Given that OpenRefine is still mostly centered around one-off data cleaning jobs it does not feel absurd to me, but of course that's something we can discuss.

To both: would it be helpful if I make a document describing:

  • the current serialization format
  • the problems I have identified with it
  • the format I have been working on
  • the features it enables
  • the ways in which one could convert from the current to the proposed format

Those are things I have documented in various places already, but perhaps synthesizing that in a standalone document would help? Would this be something you'd be interested in giving feedback on? Not as a substitute for code review, but rather as a preliminary, just to answer basic questions such as: Do we want those features in the first place? Is the proposed architecture an end goal we agree on? Or should we go for a different one? If we agree on it, are there meaningful ways to introduce it step-wise?

I understand you'd prefer to just review pull requests, and it's my job to figure out how to organize whatever changes I want to make into manageable chunks. The reality is: for those changes, I don't know how to do that. It doesn't mean it can't be done - I just haven't figured it out so far! Could you help me?

And perhaps the answer is: no, it's too big of a change, it can't be split meaningfully, it's too risky and it shouldn't be part of OpenRefine at all. Our users have lived with the scalability and reproducibility restrictions of OpenRefine for more than a decade and they will continue to live with that. I have to say: that would be an answer I would totally accept! I would have hoped I had got it earlier and spent my time on other things, and I would be sad not to be able to ship those features to users, but I can still accept that.

And again, I am keen to think about the bigger picture of many other big changes and how to orchestrate them. Sorry that I have phrased the topic in a way that was too centered on my own work.

Workspace migrations have always been one way - you backup your directory, install OpenRefine vN+1, and it upgrades everything in place for you transparently. If you need to go back, you restore your back, and go back to using the old version. I don't think there's a strong need to get fancier than that, but I'm not sure what "import projects" entails. Does the user have to do one at a time? That sounds like something that could be improved upon. If they can import their whole workspace, that sounds much easier.

I think I understand the difference between the two formats at a conceptual level:

  • current format - stores a machine readable (class name + parameters) version of the operation, a textual version of the operation and its parameters, the newly updated data, the old data that was replaced
  • proposed format - drops the new & old data and instead saves the operation to be applied to generate new from old, along with some metadata to make it easier to calculate inter-operation dependencies

Close?

An official, fleshed out, version of that along with what features are tied to what aspects of the serialization format could be interesting, but I wouldn't spend a ton of time on it, because it's just a proxy for the implementation, which is the important thing to review.

And perhaps the answer is: no, it's too big of a change, it can't be split meaningfully, it's too risky and it shouldn't be part of OpenRefine at all.

I certainly hope we don't get to that point. I hope that we continue to "rescue" functionality from the Delta branch and merge it into master. I'm willing to help with that effort.

Tom

That sounds like something that could be improved upon. If they can import their whole workspace, that sounds much easier.

Sure, there is a lot of room to streamline that.

Close?

Yes!

An official, fleshed out, version of that along with what features are tied to what aspects of the serialization format could be interesting, but I wouldn't spend a ton of time on it, because it's just a proxy for the implementation, which is the important thing to review.

Ok, I'll draft something small then.

I certainly hope we don't get to that point. I hope that we continue to "rescue" functionality from the Delta branch and merge it into master. I'm willing to help with that effort.

Great, thanks a lot! I think it's an exciting challenge too.

I'll start a separate thread about breaking change scheduling.