Clustering test data - What's your favorite thing to cluster?

tfmorris · March 24, 2026, 3:25am

The development team is on the hunt for representative test data to use in improving the performance of the OpenRefine clustering algorithms. @thadguidry put together a nice little benchmark program which he described in another post, but we need your data to drive it to best effect.

What kind of data? Whatever you spend most of your time clustering! The best data would be:

open or something which can be made open so we can include it in our test suite
diverse domains - personal names, geographic names, book titles, etc
cultural / language / character set diversity
with ground truth data - this is a stretch, but if you have data that you’ve already clustered with the original string and the normalized string in adjacent columns, that would be ideal, particularly if the normalized version has been validated in some way

If you examples check some, but not all of these boxes, we’d still love to hear from you. Also let us know which cluster algorithms and settings you use the most and we’ll focus on optimizing those.

thadguidry · March 27, 2026, 8:09am

Apache 2 licensed (but uses public domain data) string-similarity/testOutputData/mergedNames2013w1983Vs2003Vs1973Output.txt at master · neustar/string-similarity

I'll try to find some more from my archived older OpenRefine projects.

Topic		Replies	Views
Faster? Clustering Development & Design performance	12	101	March 21, 2026
Clustering based on several columns as conditions Data cleaning and transformations	4	869	February 5, 2023
2025 Barcamp Session Proposal: New Customized Clustering Feature since 3.9 Events barcamp-2025	2	68	March 12, 2026
Best Levenshtein Distance Algorithm for OpenRefine – Simile Vicino, Apache, or PassJoin? Development & Design	1	101	August 13, 2024
User-defined Clustering Project Development & Design	47	1227	September 17, 2024

Clustering test data - What's your favorite thing to cluster?

Related topics