Scheduling breaking changes we have on our radar

As a follow up to the discussion in this thread, let's start a discussion on how to schedule the various major changes we have on our radar.

I see this discussion not as a substitute for establishing a proper roadmap, but rather concentrated on changes which are "breaking" in the sense that they require adaptations for extensions. For instance, new features such as adding support for Wikidata lexemes wouldn't need to be scheduled here, because if we want to implement them it should be possible to do that at any stage, without breaking change. That being said, some new features can be made easier (or harder) by other changes, so if we are aware of such a dependency, it's something worth mentioning.

Also, the extent to which a change is breaking can be debated: there is often a choice of how much effort we want to put into making it compatible (such as by duplicating interfaces). I see it as a trade-off between our own development capacity and that of extensions, reconciliation services, and adaptation capability from the end users. I'd say there is no absolute right or wrong here and I hope we can debate those things in a calm and supportive environment.

Would it make sense to first just list the changes we have on our minds?

The ones I am aware of on top of my head are:

  • major upgrades to libraries which are part of the extension interface (Jetty, Jackson, maybe others?) - such as PR #6077
  • removing Jackson from the extension interface by introducing other mechanisms for JSON serialization, as advocated for by @tfmorris (given Jackson's stability it's not clear to me that it's worth the breakage)
  • replacing the registration of components contributed by extensions from controller.js to a declarative format (perhaps it can be done without breaking, if supporting both extension formats for a while is doable?)
  • changes to dependency isolation in Butterfly (classloader behavior)
  • change of namespace from com.google.refine to org.openrefine
  • immutable project data storage with lazily computed operations between history steps
  • changes to the way frontend assets are bundled together: for instance, changes in how extensions are expected to initialize their contribution to the frontend (and un-initialize them too, perhaps?)
  • changes to operation registration, linking the backend and frontent components together (can probably be done in a backwards-compatible way without too much effort though)

Anything else? Anything you disagree with (such as things we shouldn't do at all?)

1 Like

@antonin_d Is getting Python 3 / Jython something on the radar? Eventually, correct? But perhaps that is 2 years away because of team capacity and this thread is about things within 1-2 years, yes/no? I guess we should rename this thread to a timeline instead of generic "radar"? We have lots on our radar/roadmap, and so quantifying that more specifically would likely help extension developers.

That's a good question! If we want to go for GraalPython, then we could move to requiring GraalVM as a JRE - that's arguably an important change that we could want to schedule here too.
If instead we go for py4j, perhaps that would have less implications in terms of breaking changes but more packaging work.

Thanks for adding it to the mix in any case :slight_smile:

1 Like
  • major upgrades to libraries which are part of the extension interface (Jetty, Jackson, maybe others?) - such as PR #6077

The Butterfly upgrade includes:

  • Jetty 10 -> 12 (We only recently upgraded from Jetty 9 to Jetty 10 and haven't released that yet)
  • Java Servlet 4 -> Servlet 6 (most common uses are HttpServletRequest, HttpServletResponse, ServletException which go from javax.* to jakarta.* but rather than rote renaming, we should see how many references we can eliminate the need for)
  • Velocity 1.x -> 2.3
  • Java 8 -> Java 9 (we already require Java 11)
  • Apache Commons File Upload 1.5 -> 2.0
  • Apache commons-lang - removed (undeclared transitive dependency that OpenRefine was depending on)
  • removing Jackson from the extension interface by introducing other mechanisms for JSON serialization, as advocated for by @tfmorris (given Jackson's stability it's not clear to me that it's worth the breakage)

I think as a matter of good API hygiene we should try to exclude concrete classes and third party dependencies. The fewer requirements for shared common components we place on the extensions, the better, I think, but obviously some are necessary. For example, having them share a common logging infrastructure is a no-brainer.

  • replacing the registration of components contributed by extensions from controller.js to a declarative format (perhaps it can be done without breaking, if supporting both extension formats for a while is doable?)

This definitely seems like something that could be phased in with a deprecation period for the old style, if it makes sense.

Over and above restoring the old behavior, we should consider what rules we want to put in place for extensions and how (or if?) we want to enforce them. Undeclared usage of transitive dependencies has historically been a significant source of breakage in extensions, so anything we can do to improve that situation would be helpful. Conversely, if we can go the direction of only doing major version updates to dependencies when we do major OpenRefine releases, but that seems pretty restrictive.

  • change of namespace from com.google.refine to org.openrefine

Is it implied that this includes the Maven re-modularization? If not, that's another set of package name changes to be added to the list.

As long as we're shuffling things around, it would be worth considering what, if anything, we want to hide in internal implementation packages that aren't accessible to (or at least documented for) extensions.

  • immutable project data storage with lazily computed operations between history steps

Does this imply/bundle changes to the evaluation infrastructure which are commonly used by extensions? (Row/RecordVisitor, Operation, etc)

  • changes to the way frontend assets are bundled together: for instance, changes in how extensions are expected to initialize their contribution to the frontend (and un-initialize them too, perhaps?)
  • changes to operation registration, linking the backend and frontent components together (can probably be done in a backwards-compatible way without too much effort though)

Anything else? Anything you disagree with (such as things we shouldn't do at all?)

I think there was a grid cell renderer extension point introduced recently. Is that covered by the two items above?
The "Extensions" menu is a pretty limited (but safe) extension point. Are there additional front end extension points that could/should be contemplated?

Other items to consider:

  • Java 17 - historically we've been very conservative with bumping Java versions. Currently we require Java 11, but Jena 5 will require Java 17 and the Java ecosystem, in general, has been upgrading more quickly recently. Also, as an application rather than a library, there are fewer reasons for OpenRefine to be conservative in its usage of new Java features.

  • Jena 5 - I don't think we have a functional need for this, but the Jena project doesn't seem to do security updates for previous versions. Upgrading this has historically caused problems for the rdf-extension (perhaps a good opportunity to figure out long-term solutions to incompatible versions of common dependencies?)

  • Extension preferences - currently extensions are required to use Jackson and to embed a hardcoded class name in any custom preference data types. We should figure out a better scheme for this

  • REST API protocol review/update, versioning and documenting as a public API (could be broken into two or items)

  • Incompatible evaluation results for GREL and/or GREL functions - are we committed to bug for bug compatibility? When / how are we allowed to change results of evaluations?

  • Crufty GREL function signatures - we have a number of functions which were extended over time in a backward compatible way, but the strictures of backward compatibility have given them funky definitions which could be cleaned up (e.g. Locale addition to various format/parse operations)

  • Operation history versioning & standardization - we've actively discouraged people from using this except in very limited contexts, but to what extent have they come to rely on it anyway and are going to be burned by any changes? What's their upgrade path? (Presumably some of this already changes with the project serialization format?)
    There's probably more, but that seems like plenty to start with :grin:

Tom

To avoid hardcoding of CONFIG-like things in general...

For the extension preferences and even registration of components, I've always thought that configuration stuff like this should go into a TOML file where the general top structures are done by us and where extension developers can then also use lower nested levels additionally for storing any of their key/values. It benefits everyone with things like comments, literal strings, multi-line literal strings, etc. and as a bonus it maps to a hash table.

I have been thinking about this more and I think the Maven re-modularization can be introduced without breaking anything. We can move classes to different Maven modules without changing their package name. That means basically just changing the directory structure. Arguably it's cleaner to have the Maven module be reflected in the package name, but it's not necessary as a first step. Because the main module we currently have would depend on the new modules introduced, any existing Maven project (such as an extension developed outside of our code base) that currently depends on this main artifact would still pull in the dependencies transitively.

And it would already buy us a lot: the ability to re-use testing utilities (deleting the re-implementation we have in the Wikibase extension for instance) and avoiding dependency creep between different parts of the code base. So I am planning to submit a PR against master for that.

Does this imply/bundle changes to the evaluation infrastructure which are commonly used by extensions? (Row/RecordVisitor, Operation, etc)

Yes absolutely, it includes a change of interface for operations, since the way they access and modify project data needs to change.

I think there was a grid cell renderer extension point introduced recently. Is that covered by the two items above?

I didn't think about including it because I don't see it as a breaking change (and it's released already). Or do you foresee changes to it?

Incompatible evaluation results for GREL and/or GREL functions - are we committed to bug for bug compatibility? When / how are we allowed to change results of evaluations?

That's a topic we've been battling with for a long time… I wonder if it would make sense to version GREL (or other expression languages) independently of OpenRefine? Not sure what it would look like.
Perhaps there is a way to take inspiration from spreadsheet tools there. Does LibreOffice or Excel ever change the implementation of the functions that are available in their formula language?

Here are some more topics that I forgot:

  • Removal of column groups (which is part of my big stack of changes)
  • Changing the definition of record boundaries (which I need to bring up in a different thread)

@abbe98 are there any changes you would like to put on the list, or weigh in about any of the ones already listed above? Otherwise I'd try to propose a schedule based on what we have here already.