Faster? Clustering

One of the things I’ve been working on is going back to some old problem areas in Clustering performance I once saw and using AI to see if it might help in some capacity. I knew that there where areas I saw in Simile-Vicino that caused excessive thread delay with certain algorithms and wanted to find out why and perhaps even get some more optimization. My hunch seems to have been fruitful. I found a few areas in SecondString (Vicino dependency) which could be improved and applied a few Vector API optimizations as well where it could benefit. There might be other areas and dependencies in Simile-Vicino and even OpenRefine’s code itself, that I have not looked at yet but I’ll save that for another day when I have some more freetime. Clustering performance scales depending on the size of strings and thus tokens needing to parse/bin/index etc.

I’m not quite sure how to contribute this. My hunch is that I can probably just create a dedicated repo for SecondString-fast for build & review? @tfmorris any suggestions here?

Here’s the JMH performance benchmarks against a few known large test datasets that are using a new optimized build of SecondString-fast.jar.

WARNING: Using incubator modules: jdk.incubator.vector

=== Dataset: acm_large ===

rows=2294 avg_len=143.9 max_len=392

JaroWinkler n=140 orig=1361ms fast=156ms speedup=8.72x

SoftTFIDF n=90 orig=891ms fast=509ms speedup=1.75x

Dictionary n=2294 q=100 orig=8111ms fast=3343ms speedup=2.43x

=== Dataset: dblp_large ===

rows=2616 avg_len=121.8 max_len=398

JaroWinkler n=140 orig=932ms fast=113ms speedup=8.25x

SoftTFIDF n=90 orig=663ms fast=246ms speedup=2.70x

Dictionary n=2616 q=100 orig=6822ms fast=3548ms speedup=1.92x

=== Dataset: itunes_amazon_tableB_large ===

rows=55923 avg_len=142.1 max_len=550

JaroWinkler n=140 orig=2075ms fast=155ms speedup=13.39x

SoftTFIDF n=90 orig=825ms fast=447ms speedup=1.85x

DictionarDictionary n=12000 q=100 orig=52307ms fast=14837ms speedup=3.53x

Cool. It's been a while since I looked at the clustering, but my performance thoughts were captured here:
https://github.com/OpenRefine/OpenRefine/issues/3618

I’m not quite sure how to contribute this. My hunch is that I can probably just create a dedicated repo for SecondString-fast for build & review? @tfmorris any suggestions here?

I'm not sure it's actively maintained, but I'd try contributing pull requests upstream:
https://github.com/TeamCohen/secondstring

I suspect they'd appreciate the benchmark (if they don't already have one) and the performance optimizations as separate pull requests.

To be honest though, I'm not sure how critical performance was/is to them since it came out of academia. That package was written in 2003, so there are likely better alternatives these days.

None of the algorithms that you list are used by OpenRefine (as far as I know). Do your performance speedups also affect things like Levenshtein?

Tom

I’m looking into the other algorithms this week.

@tfmorris Can you confirm that the only kNN distances that OpenRefine registers is for levenshtein (actually lives in secondstring repo) and ppm(actually lives in simile-vicino repo) ?

So far, for levenshtein I’ve been able to produce about an 8x speed improvement.

https://github.com/thadguidry/secondstring-fast

ppm is a bit harder (it uses compression) and I need to give more hints and guidance to the agent, and it might not produce as significant gains, but I’ll see (might depend on Java’s intrinsic methods, but I’ll make sure there’s proper auto fallback if some are not supported on a CPU - mine is).

Oh also, which simile-vicino repo should I fork from?

Those are the only two (besides user defined distances using expressions). They are registered here:
https://github.com/OpenRefine/OpenRefine/blob/7ffb1b3391773a4a688cdf4ca4957892d01af377/main/webapp/modules/core/MOD-INF/controller.js#L357-L358
and the repo is here: https://github.com/OpenRefine/simile-vicino

BUT, unless you're just doing this for fun, before you invest too much time, I'd suggest benchmarking the Apache implementation to see if it's worth trying to optimize SecondString. It has a nice variant that allows you to pass in the max edit distance so that it can exit early when it exceeds the max, rather than computing the edit distance for the entire string, only to prune it later.

Tom

1 Like

Yeap, my focus was indeed to benchmark the Apache implementation and compare. I’ll let you know!

@tfmorris DONE. I’ve add the Apache Commons Text Similarity to the benchmarks for comparison, with notes in the README.md

My SecondString-fast is a bit faster than Apache Similarity, but that’s to be expected because I haven’t contributed to Apache Similarity any kind of Vector API optimizations which do help quite a bit for parallel string comparison operations.

Since it’s a small improvement over Apache and not huge (1.17x), I’d likely just pull in the Apache implementation into OpenRefine instead of SecondString-fast, and then later I can contribute similar optimizations to Apache Similarity when time permits. And as you said, There’s a nice variant to pass in the max edit distance.

Take a look at the benchmarks in README or run them yourself.
https://github.com/thadguidry/secondstring-fast

```
=== Dataset: itunes_amazon_tableB_large ===
rows=55923 avg_len=142.1 max_len=550
Levenstein n=140 orig=3437ms fast=438ms commons=512ms fast/orig=7.85x commons/orig=6.71x commons/fast=1.17x
```

Thanks for reporting back (and doing the work in the first place)/

Someone replied to a topic you are Watching.

Since it’s a small improvement over Apache and not huge (1.17x), I’d likely just pull in the Apache implementation into OpenRefine instead of SecondString-fast

More importantly Apache Commons is 6.5-8.5x the speed of our current implementation, even without the early bailout, so definitely a change worth doing.

I'm also interested in finding a quicker (and better supported) algorithm than our current arithmetic coder with the PPM model.

Lastly, we still have the PassJoin clusterer to integrate which showed a lot of promise.

Tom

No problem. It’s just a few tokens :slight_smile:
So this week and next, I’ll work towards the following PR and benchmarks:

  1. PR to integrate Apache Commons Text Similarity (released version)
  2. Find and show benchmarks for a quicker algorithm than our current PPM arithmetic coder.
  3. (OPTIONAL) Integrate the PassJoin cluster as a new algorithm for Clustering

Sounds good? Is 3. the right approach as a “new” algorithm choice in the Clustering interface?

FYI, regarding Apache Commons Text Levenshtein… looks like another 21% gain could be had further down the line… here’s Claude Opus 4.6 proving that the smart use of Vector API in places could help parallelize string comparisons in the library. (Claude took about 30 mins to figure it all out and ensure it passed tests after refactoring with Vector API usage)

Claude Opus 4.6 analysis of Apache Commons Text

LevenshteinDetailedDistance.java

The results are in. Here's the full picture with the acm_large.txt long-string dataset:

Benchmark Score (ms/op) ±error vs. Unlimited Baseline
Unlimited — Boxed SimilarityInput (baseline) 40.071 ±1.109 —
Unlimited — CharSequence fast path 41.986 ±0.851 ~same
Unlimited — Vector API 31.695 ±1.560 ~21% faster
Threshold(7) — Boxed baseline 10.418 ±0.448 —
Threshold(7) — CharSequence fast path 10.408 ±0.449 ~same
The Vector API implementation is ~21% faster than the scalar baseline on this realistic long-string workload.

Where the speedup comes from
Vectorized cost computation — Comparing 8-16 characters at once against the broadcast rightJ using ShortVector.eq(), which maps to PCMPEQW (SSE2) or wider AVX instructions.

Vectorized vertical/diagonal min — IntVector.min() computes min(p[i]+1, p[i-1]+cost[i]) across 4-8 lanes simultaneously using PMINSD (SSE4.1).

Scalar prefix fixup remains — The d[i-1]+1 dependency is inherently sequential, but computing the lower bound vectorially means the fixup pass often finds candidateVD[i] <= d[i-1]+1 and doesn't propagate further.

Why this can't go into the library source today:

The project targets Java 8 (maven.compiler.source=1.8), and jdk.incubator.vector requires Java 16+. If the project ever moves to Java 21+ (or adopts multi-release JARs), the vectorized implementation could be added as a runtime-detected fast path. The benchmark class at LevenshteinVectorized.java serves as a proof-of-concept for that future path.

I had a nice chat with Gemini the other day about string similarity algorithm optimization strategies and it knocked out 10x super optimized, zero copy implementations of Levenshtein, Zstd
NCD, Jaro-Winkler, and Dice-Sørensen in no time, but then I went to look at where they'd get integrated and realized that it would make a limited difference. (ie I asked Gemini to solve the wrong problem.)

While our string similarity implementations aren't super fast, the real killer is the internal data organization and the way it is accessed. The data is stored in rows and all access is mediated through (multiple) Visitor patterns. This is very flexible, but also very slow. All modern systems of this type store data in columns so that they are cache friendly and can be easily vectorized.

All of these optimized string similarity algorithms want to be fed a vector of strings. The way the OpenRefine clusterer works is to visit each row one at a time and for each row evaluate the active facets one at a time and the, if all the facets evaluate to true, look up the string in the appropriate column and add it to the clusterer, then move on to the next row in the project and do it all over again. If the user changes any of the cluster parameters, start over and do it all again from scratch.

We'd likely get a much bigger payback from reorganizing the dataflow than from optimizing the individual algorithms (although obviously both help). What we need to measure this is a benchmark at the level of the clustering operation, rather than at the level of the string similarity function.

I've thought about switching to column stores several times over the years and have always been put off by the size of the task, but we might be able to optimize the clustering specifically, by doing things like prefetching and caching the selected row indexes and strings, then passing them to the string similarity algorithms as a vector and preserving the cache if all the user has changed is the algorithm or one of its parameters and not the underlying filters. I haven't looked at it in detail, but my gut feel is that it could offer pretty big savings for clustering operations.

From a UX point of view, in addition to making things faster, making the clustering cancellable and, perhaps trying multiple algorithms hierarchically (e.g. Levenshtein/Jaro-Winkler then NCD for final disambiguation) so the user doesn't have to guess at which to try, would probably improve the user experience.

Tom

Hmm,
I was aware of the row-based, but not aware of the multiple visitor patterns for data access from the Clusterer!

Do you want to take on that task of making the Clusterer column-based in nature? That would be awesome.

Multi-modal DB would be ideal for a number of reasons (aggregations!!!!!, cross/join projects and columns). Luca G. has always wanted to help us get ArcadeDB (OrientDB) embedded as the core of OpenRefine. I pushed Luca to get SIMD acceleration in place and he did it. ArcadeDB starts/stops with the application when it's embedded. Perhaps with Gemini, your brain, and Luca's...it could happen over the course of a weekend?