Best Levenshtein Distance Algorithm for OpenRefine – Simile Vicino, Apache, or PassJoin?

zyadtaha · August 13, 2024, 4:08pm

Hi OpenRefine Community,

I wanted to bring up a discussion regarding the Levenshtein distance implementations used in OpenRefine. We’ve been exploring three different implementations in this PR: Simile Vicino, Apache Commons Text, and the PassJoin algorithm. I’d like to gather input from the community on which would be the most suitable for our needs.

1. Simile Vicino:

This has been our go-to implementation, but it’s relatively slow and has limitations in performance, especially with large datasets.

2. Apache Commons Text:

Offers a more optimized implementation that’s faster and more memory-efficient, but it seems to produce different clustering results compared to Simile Vicino, possibly due to different handling of whitespace and character boundaries.

3. PassJoin Algorithm:

Known for its speed and efficiency, particularly with large datasets. However, it introduces complexity in handling certain string formats and boundaries. Its implementation might also differ from the traditional Levenshtein distance that some users expect.

Given these considerations, I’d like to open the floor for discussion:

Which algorithm do you think should be the default in OpenRefine, considering our typical use cases?
Should we provide users with the option to choose between these algorithms in the UI, depending on their specific needs?
Are there any other factors or potential pitfalls we should consider in this decision?

Feel free to share your experiences, concerns, or any additional thoughts you might have.

Martin · August 13, 2024, 4:55pm

Based on my experience, there is no single "right" clustering algorithm as it depends on the dataset and the user's objectives. I usually suggest experimenting with different algorithms, often starting with the most conservative ones such as fingerprint or levenstein.

So I would favor supporting the three implementations you are suggesting and letting the user choose what best fits their use case.

Topic		Replies	Views
Data Cleaning and Transformation Automation and Performance Data cleaning and transformations	4	446	May 3, 2023
User-defined Clustering Project Development & Design	47	306	September 17, 2024
Cluster and edit function returns similar values Support and Helpdesk	3	70	June 12, 2024
Clustering method selection: immediate start Support and Helpdesk	1	32	March 27, 2025
Cluster and edit facet automation Data cleaning and transformations	2	46	July 18, 2024

Best Levenshtein Distance Algorithm for OpenRefine – Simile Vicino, Apache, or PassJoin?

Related topics