Clustering test data - What's your favorite thing to cluster?

The development team is on the hunt for representative test data to use in improving the performance of the OpenRefine clustering algorithms. @thadguidry put together a nice little benchmark program which he described in another post, but we need your data to drive it to best effect.

What kind of data? Whatever you spend most of your time clustering! The best data would be:

  • open or something which can be made open so we can include it in our test suite
  • diverse domains - personal names, geographic names, book titles, etc
  • cultural / language / character set diversity
  • with ground truth data - this is a stretch, but if you have data that you’ve already clustered with the original string and the normalized string in adjacent columns, that would be ideal, particularly if the normalized version has been validated in some way

If you examples check some, but not all of these boxes, we’d still love to hear from you. Also let us know which cluster algorithms and settings you use the most and we’ll focus on optimizing those.

Apache 2 licensed (but uses public domain data) string-similarity/testOutputData/mergedNames2013w1983Vs2003Vs1973Output.txt at master · neustar/string-similarity

I'll try to find some more from my archived older OpenRefine projects.