One of the things I’ve been working on is going back to some old problem areas in Clustering performance I once saw and using AI to see if it might help in some capacity. I knew that there where areas I saw in Simile-Vicino that caused excessive thread delay with certain algorithms and wanted to find out why and perhaps even get some more optimization. My hunch seems to have been fruitful. I found a few areas in SecondString (Vicino dependency) which could be improved and applied a few Vector API optimizations as well where it could benefit. There might be other areas and dependencies in Simile-Vicino and even OpenRefine’s code itself, that I have not looked at yet but I’ll save that for another day when I have some more freetime. Clustering performance scales depending on the size of strings and thus tokens needing to parse/bin/index etc.
I’m not quite sure how to contribute this. My hunch is that I can probably just create a dedicated repo for SecondString-fast for build & review? @tfmorris any suggestions here?
Here’s the JMH performance benchmarks against a few known large test datasets that are using a new optimized build of SecondString-fast.jar.
WARNING: Using incubator modules: jdk.incubator.vector
=== Dataset: acm_large ===
rows=2294 avg_len=143.9 max_len=392
JaroWinkler n=140 orig=1361ms fast=156ms speedup=8.72x
SoftTFIDF n=90 orig=891ms fast=509ms speedup=1.75x
Dictionary n=2294 q=100 orig=8111ms fast=3343ms speedup=2.43x
=== Dataset: dblp_large ===
rows=2616 avg_len=121.8 max_len=398
JaroWinkler n=140 orig=932ms fast=113ms speedup=8.25x
SoftTFIDF n=90 orig=663ms fast=246ms speedup=2.70x
Dictionary n=2616 q=100 orig=6822ms fast=3548ms speedup=1.92x
=== Dataset: itunes_amazon_tableB_large ===
rows=55923 avg_len=142.1 max_len=550
JaroWinkler n=140 orig=2075ms fast=155ms speedup=13.39x
SoftTFIDF n=90 orig=825ms fast=447ms speedup=1.85x
DictionarDictionary n=12000 q=100 orig=52307ms fast=14837ms speedup=3.53x