Scheduling breaking changes we have on our radar

As a follow up to the discussion in this thread, let's start a discussion on how to schedule the various major changes we have on our radar.

I see this discussion not as a substitute for establishing a proper roadmap, but rather concentrated on changes which are "breaking" in the sense that they require adaptations for extensions. For instance, new features such as adding support for Wikidata lexemes wouldn't need to be scheduled here, because if we want to implement them it should be possible to do that at any stage, without breaking change. That being said, some new features can be made easier (or harder) by other changes, so if we are aware of such a dependency, it's something worth mentioning.

Also, the extent to which a change is breaking can be debated: there is often a choice of how much effort we want to put into making it compatible (such as by duplicating interfaces). I see it as a trade-off between our own development capacity and that of extensions, reconciliation services, and adaptation capability from the end users. I'd say there is no absolute right or wrong here and I hope we can debate those things in a calm and supportive environment.

Would it make sense to first just list the changes we have on our minds?

The ones I am aware of on top of my head are:

  • major upgrades to libraries which are part of the extension interface (Jetty, Jackson, maybe others?) - such as PR #6077
  • removing Jackson from the extension interface by introducing other mechanisms for JSON serialization, as advocated for by @tfmorris (given Jackson's stability it's not clear to me that it's worth the breakage)
  • replacing the registration of components contributed by extensions from controller.js to a declarative format (perhaps it can be done without breaking, if supporting both extension formats for a while is doable?)
  • changes to dependency isolation in Butterfly (classloader behavior)
  • change of namespace from com.google.refine to org.openrefine
  • immutable project data storage with lazily computed operations between history steps
  • changes to the way frontend assets are bundled together: for instance, changes in how extensions are expected to initialize their contribution to the frontend (and un-initialize them too, perhaps?)
  • changes to operation registration, linking the backend and frontent components together (can probably be done in a backwards-compatible way without too much effort though)

Anything else? Anything you disagree with (such as things we shouldn't do at all?)

1 Like

@antonin_d Is getting Python 3 / Jython something on the radar? Eventually, correct? But perhaps that is 2 years away because of team capacity and this thread is about things within 1-2 years, yes/no? I guess we should rename this thread to a timeline instead of generic "radar"? We have lots on our radar/roadmap, and so quantifying that more specifically would likely help extension developers.

That's a good question! If we want to go for GraalPython, then we could move to requiring GraalVM as a JRE - that's arguably an important change that we could want to schedule here too.
If instead we go for py4j, perhaps that would have less implications in terms of breaking changes but more packaging work.

Thanks for adding it to the mix in any case :slight_smile:

1 Like
  • major upgrades to libraries which are part of the extension interface (Jetty, Jackson, maybe others?) - such as PR #6077

The Butterfly upgrade includes:

  • Jetty 10 -> 12 (We only recently upgraded from Jetty 9 to Jetty 10 and haven't released that yet)
  • Java Servlet 4 -> Servlet 6 (most common uses are HttpServletRequest, HttpServletResponse, ServletException which go from javax.* to jakarta.* but rather than rote renaming, we should see how many references we can eliminate the need for)
  • Velocity 1.x -> 2.3
  • Java 8 -> Java 9 (we already require Java 11)
  • Apache Commons File Upload 1.5 -> 2.0
  • Apache commons-lang - removed (undeclared transitive dependency that OpenRefine was depending on)
  • removing Jackson from the extension interface by introducing other mechanisms for JSON serialization, as advocated for by @tfmorris (given Jackson's stability it's not clear to me that it's worth the breakage)

I think as a matter of good API hygiene we should try to exclude concrete classes and third party dependencies. The fewer requirements for shared common components we place on the extensions, the better, I think, but obviously some are necessary. For example, having them share a common logging infrastructure is a no-brainer.

  • replacing the registration of components contributed by extensions from controller.js to a declarative format (perhaps it can be done without breaking, if supporting both extension formats for a while is doable?)

This definitely seems like something that could be phased in with a deprecation period for the old style, if it makes sense.

Over and above restoring the old behavior, we should consider what rules we want to put in place for extensions and how (or if?) we want to enforce them. Undeclared usage of transitive dependencies has historically been a significant source of breakage in extensions, so anything we can do to improve that situation would be helpful. Conversely, if we can go the direction of only doing major version updates to dependencies when we do major OpenRefine releases, but that seems pretty restrictive.

  • change of namespace from com.google.refine to org.openrefine

Is it implied that this includes the Maven re-modularization? If not, that's another set of package name changes to be added to the list.

As long as we're shuffling things around, it would be worth considering what, if anything, we want to hide in internal implementation packages that aren't accessible to (or at least documented for) extensions.

  • immutable project data storage with lazily computed operations between history steps

Does this imply/bundle changes to the evaluation infrastructure which are commonly used by extensions? (Row/RecordVisitor, Operation, etc)

  • changes to the way frontend assets are bundled together: for instance, changes in how extensions are expected to initialize their contribution to the frontend (and un-initialize them too, perhaps?)
  • changes to operation registration, linking the backend and frontent components together (can probably be done in a backwards-compatible way without too much effort though)

Anything else? Anything you disagree with (such as things we shouldn't do at all?)

I think there was a grid cell renderer extension point introduced recently. Is that covered by the two items above?
The "Extensions" menu is a pretty limited (but safe) extension point. Are there additional front end extension points that could/should be contemplated?

Other items to consider:

  • Java 17 - historically we've been very conservative with bumping Java versions. Currently we require Java 11, but Jena 5 will require Java 17 and the Java ecosystem, in general, has been upgrading more quickly recently. Also, as an application rather than a library, there are fewer reasons for OpenRefine to be conservative in its usage of new Java features.

  • Jena 5 - I don't think we have a functional need for this, but the Jena project doesn't seem to do security updates for previous versions. Upgrading this has historically caused problems for the rdf-extension (perhaps a good opportunity to figure out long-term solutions to incompatible versions of common dependencies?)

  • Extension preferences - currently extensions are required to use Jackson and to embed a hardcoded class name in any custom preference data types. We should figure out a better scheme for this

  • REST API protocol review/update, versioning and documenting as a public API (could be broken into two or items)

  • Incompatible evaluation results for GREL and/or GREL functions - are we committed to bug for bug compatibility? When / how are we allowed to change results of evaluations?

  • Crufty GREL function signatures - we have a number of functions which were extended over time in a backward compatible way, but the strictures of backward compatibility have given them funky definitions which could be cleaned up (e.g. Locale addition to various format/parse operations)

  • Operation history versioning & standardization - we've actively discouraged people from using this except in very limited contexts, but to what extent have they come to rely on it anyway and are going to be burned by any changes? What's their upgrade path? (Presumably some of this already changes with the project serialization format?)
    There's probably more, but that seems like plenty to start with :grin:

Tom

To avoid hardcoding of CONFIG-like things in general...

For the extension preferences and even registration of components, I've always thought that configuration stuff like this should go into a TOML file where the general top structures are done by us and where extension developers can then also use lower nested levels additionally for storing any of their key/values. It benefits everyone with things like comments, literal strings, multi-line literal strings, etc. and as a bonus it maps to a hash table.

I have been thinking about this more and I think the Maven re-modularization can be introduced without breaking anything. We can move classes to different Maven modules without changing their package name. That means basically just changing the directory structure. Arguably it's cleaner to have the Maven module be reflected in the package name, but it's not necessary as a first step. Because the main module we currently have would depend on the new modules introduced, any existing Maven project (such as an extension developed outside of our code base) that currently depends on this main artifact would still pull in the dependencies transitively.

And it would already buy us a lot: the ability to re-use testing utilities (deleting the re-implementation we have in the Wikibase extension for instance) and avoiding dependency creep between different parts of the code base. So I am planning to submit a PR against master for that.

Does this imply/bundle changes to the evaluation infrastructure which are commonly used by extensions? (Row/RecordVisitor, Operation, etc)

Yes absolutely, it includes a change of interface for operations, since the way they access and modify project data needs to change.

I think there was a grid cell renderer extension point introduced recently. Is that covered by the two items above?

I didn't think about including it because I don't see it as a breaking change (and it's released already). Or do you foresee changes to it?

Incompatible evaluation results for GREL and/or GREL functions - are we committed to bug for bug compatibility? When / how are we allowed to change results of evaluations?

That's a topic we've been battling with for a long time… I wonder if it would make sense to version GREL (or other expression languages) independently of OpenRefine? Not sure what it would look like.
Perhaps there is a way to take inspiration from spreadsheet tools there. Does LibreOffice or Excel ever change the implementation of the functions that are available in their formula language?

Here are some more topics that I forgot:

  • Removal of column groups (which is part of my big stack of changes)
  • Changing the definition of record boundaries (which I need to bring up in a different thread)

@abbe98 are there any changes you would like to put on the list, or weigh in about any of the ones already listed above? Otherwise I'd try to propose a schedule based on what we have here already.

I'm wondering how to make progress on this.

I have to say those scheduling decisions do depend on deciding on a longer-term roadmap, because some of the features that would be listed there would require breaking changes, of course.

As I have just posted in the thread about extension support, I see the state of affairs there as very dire, so I'd be tempted to argue that for now we should only do breaking changes that are meant to improve extension stability (apart from vulnerability fixing, say), and then once that is stable we could work on new features again. Of course that would completely stall the reproducibility project, which is not exactly ideal to me…

Honestly our situation feels so bad that I am not sure if incremental improvements, even if they are carefully planned, are desirable at all. I have started to think that rewriting OpenRefine from scratch in a different stack is not that crazy of an idea. Over the past years, whenever people would encourage me to rewrite OpenRefine in a different language / framework (which happens regularly), I would politely smile and decline, saying that the satisfaction of maintaining existing software that is useful to many is more important to me than the satisfaction of working with the latest fanciest framework. Well, this problem is really making me reconsider that response.

Sure, rewriting OpenRefine is a lot more work than what is needed to patching the extension system. But if we do that work, we also have the opportunity to improve on a lot of other things:

  • migrate out of Python 2.7 to something a bit more recent? There are prospects to do this while remaining with a Java backend (GraalPython, py4j) but there's still quite a lot of uncertainty around that in my opinion
  • adopt a stack that is approachable and attractive to more developers, to make the project more sustainable? I think Java and jQuery put off a lot of people.
  • by rewriting the UI in a modern stack, we'd likely use an existing widget library, which would give a fresh coat of paint on the UI in the same go
  • although I have been trying to split down my new architecture into reviewable chunks, I still don't know how to split the central piece at all, so a rewrite from the ground up would be also be an occasion to introduce this sort of architectural change.
  • we have a pretty decent end-to-end test suite that we might be able to keep and adapt if keep an eye on CSS selectors (assuming the UI is still web based, which I think I would support)

Maybe it's just me being too depressed about the state of things and I just need to take proper holidays (which I am actually counting on doing, yay!) but I feel like this sort of thoughts is worth voicing.

:grin: What stack or kinds of stacks were you thinking that might make us more approachable for more developers? No winners or losers. But I think we can talk openly about this and listen to everyone about some pros/cons for attracting contributors. So I definitely would like to hear what your thinking (then you can take a vacation and think some more). :grin:

But I can start.
For frontend and backend, I'll reserve my opinion until later, but here's my take...

Stack Architecture suggestion

Frontend:

  • Progressive JS framework (we can rewrite & adopt in slowly)
  • Components should be easily extendable
  • Does not use a virtual dom, fast
  • Must be dev friendly, concise, and easy to use
  • Well documented in English

Backend:

  • Java interoperability (so we can rewrite & adopt slowly)
  • type inference
  • null safety
  • Extension mechanisms built-in like extending classes without inheritance
  • OOP but supports functional patterns
  • functions as first-class citizens
  • data classes (no constructors/getters/setters needed!)
  • easily access values on any object (supports destructuring)

Extensions

There's also some additional goodness we probably should like to have when it comes to extending GREL itself. That knowledge and tidbits are from Stefano in the old mailing list, but in essence ensuring GREL is useful:

My opinion is that GREL should be a streamlined condensation of the tools and operations that people find they need the most when working with data in their real-life workflows. The benefit of that is that we can make something very compact like

value.parseHTML().find(".whatever").split("|")[1]

once we know what we need.

By no means people should stop experimenting with Jython and Clojure support in Refine, that's why we have support for additional languages that already come with extensive APIs.

My suggestion for Acre's HTML parsing machinery integration is merely that we already have it and it's working well for us in many apps we use for Freebase so it might be useful here too.

(he's was talking about Acre's code that includes an HTML scanner/parser engine and lots! of other good code - open source - DOM folder) acre/webapp/WEB-INF/src/com/google/acre/DOM/AcreHTMLScanner.java at master · googlearchive/acre · GitHub

Regarding Extensions and perhaps referencing each other, Stefano at one time said this:

yes, extensions can reference each other using this syntax

<link href="[#core#]/style.css" type="text/css" rel="stylesheet" />

For more info, read

http://code.google.com/p/simile-butterfly/wiki/UserGuide

We didn't envision extensions needing to do too much, but eventually it became apparent we needed to do more. In early days of Gridworks, David designed the ClientSideResourceManager and initially Butterfly and our code didn't have ability to inject client side scripts inside the core templates. But eventually we added that to help extension developers.

Alternative 1

  • Drop extension support
  • Move to a Core Only approach
  • Embrace Components framework
  • Shun a ServiceLoader approach, hmm, or maybe not?
  • Ask or help extension developers to move into Core
  • Have extensions pushed out to Maven Central and pull them in as core dependency
  • If non-compatible & fails testing - we fix, get them to fix, get whole community to fix.
  • Extension developers actually begin to appreciate us. They are in Core!
  • Users have confidence the extensions work in the current version and tested.
  • Enabling and Disabling extensions through Preferences, whatever.
  • Discovery of extensions is easy. Just look in OpenRefine at Extensions page!
  • Extensions live... but just don't run when disabled.
  • Require restart for enable/disable (otherwise, things just go BOOM anyways)
  • Our contributors and developers expands for Core Team because suddenly extension developers become part of the Core Team! or stated differently "become direct contributors"

Cons for Alternative 1

  • Startup time is reduced. But seriously, who cares really on 15-30 seconds to start.
  • But startup time can drastically improve if we native build with GraalVM.
  • Python extension stays at 2.7 for time being

Alternative 2

Alternative 3

  • REST API protocol review/update, versioning and documenting as a public API (could be broken into two or items)

I would suggest adopting the OpenAPI Specification. This enables a standardized API development and fast tracks API consumers alleviating the need to develop an API client themselves as OpenAPI Client Generators do them with a single command. In fact, the Server Generators support Java may be used to generate server-side code. With this spec also comes the ability for a Swagger UI and Documentation Generators.

Replying in reverse order to DaxServer, Thad, and Antonin:

  1. Using OpenAPI/Swagger to document the API and generate clients is definitely the way to go, but I'm assuming this would be more than just a straight documentation of the existing API, but also a cleanup with consistent return formats, etc.
  2. Thad's quotes from Stefano are from:
  1. I don't see a rewrite such as proposed by Antonin as something which is feasible. Big bang rewrites always seem more attractive and easier than the messy alternative of dealing with existing code, but they always take longer than predicted and often fail altogether. Joel summarizes my view on this pretty well.

    Focusing on extension stability first makes sense to me, but I don't think the current situation is as dire as Antonin feels it is. I think it's mostly a matter of being a little bit more disciplined about evolving our APIs and adding tooling support to help us know when we're breaking an API unintentionally.
    Tom

I think it's really great to see that you are still motivated to maintain this code base as it stands, @tfmorris. I am definitely not going to try to change your mind on this and will happily leave you to it :slight_smile: My message above was more meant to convey my general tiredness of working on this, which is more personal.