4.0 architecture and future JDK 20+ compatibility

@antonin_d I just noticed that David Delabassee posted the latest JDK updates to the Jython mailing list.
Of importance is the upcoming changes for Locale handling and Unicode CLDR Version 42.

Something to track or think about on 4.0 architecture maybe now or later.

Heads-Up - JDK 20 - Support for Unicode CLDR Version 42

The JDK’s locale data is based on the Unicode Consortium’s Unicode
Common Locale Data Repository (CLDR). As mentioned in the December 2022
Quality Outreach newsletter [1], JDK 20 upgraded CLDR [2] to version 42
[3], which was released in October 2022. This version includes a “more
sophisticated handling of spaces” [4] that replaces regular spaces with
non-breaking spaces (NBSP / \u00A0) or narrow non-breaking spaces
(NNBSP / \u202F):

  • in time formats between a and time
  • in unit formats between {0} and unit
  • in Cyrillic date formats before year marker such as г

Other noticeable changes include:

  • " at " is no longer used for standard date/time format ’ [5]
  • fix first day of week info for China (CN) [6]
  • Japanese: Support numbers up to 9999京 [7]

As a consequence, production and test code that produces or parses
locale-dependent strings like formatted dates and times may change
behavior in potentially breaking ways (e.g. when a handcrafted datetime
string with a regular space is parsed, but the parser now expects an
NBSP or NNBSP). Issues can be hard to analyze because expected and
actual strings look very similar or even identical in various text
representations. To detect and fix these issues, make sure to use a text
editor that displays different kinds of spaces differently.

If the required fixes can’t be implemented when upgrading to JDK 20,
consider using the JVM argument -Djava.locale.providers=COMPAT to use
legacy locale data. Note that this limits some locale-related
functionality and treat it as a temporary workaround, not a proper
solution. Moreover, the COMPAT option will be eventually removed in
the future.

It is also important to keep in mind that this kind of locale data
evolves regularly so programs parsing/composing the locale data by
themselves should be routinely checked with each JDK release.

[1]
https://mail.openjdk.org/pipermail/quality-discuss/2022-December/001100.html
[2] [JDK-8284840] Update CLDR to Version 42.0 - Java Bug System
[3] Unicode CLDR - CLDR 42 Release Note
[4] [CLDR-14032] - Unicode Consortium
[5] [CLDR-14831] - Unicode Consortium
[6] [CLDR-11510] - Unicode Consortium
[7] [CLDR-15966] - Unicode Consortium

Hi @thadguidry, at a first glance I do not think those changes should interfere with OpenRefine 3 or 4. I guess we can start thinking about adding JDK 20 to the CI and see how our test suite runs there.

FYI, Java 22 is GA, and JDK 23 EA is available. That means we can start testing against Java 23 to see how we might fair against this:

Heads-up: JDK 20-23: Support for Unicode CLDR Version 42

The JDK update to CLDR version 42 included a change where regular spaces in date/time formats (and some other formatted values) were replaced with (narrow) non-breaking spaces. This lead to issues for existing code that relied on parsing such strings. To address that, JDK 23 allows loose matching of spaces when parsing date/time strings. Loose matching is performed in the lenient parsing style for both date/time parsers in java.time.format and java.text packages. In the default strict parsing style, those spaces are considered distinct as before.

Please read this updated heads-up [9] for details on how to configure strict/lenient parsing in the java.time.format (strict by default) and java.text (lenient by default) packages.
[9] Quality Outreach Heads-up - JDK 20-23: Support for Unicode CLDR Version 42 – Inside.java

@tfmorris or @antonin_d Do you have any concerns of this towards our Roadmap?

I'm happy to add newer versions of Java to the CI whenever they are available. I generally don't expect it to bring in any breaking changes so I don't think it needs any particular scheduling, but if breakages do appear, we can assess how to deal with them accordingly.

1 Like

Thanks for the heads up. Since we basically pass this functionality through transparently, it's definitely something that we should pay attention to and include in the release notes since it will change the output of the date formatting operations, but I don't think it's a huge concern, particularly since they're updating the parsing side of things to accommodate.

Tom

1 Like