Hi OpenRefine Community,
I wanted to bring up a discussion regarding the Levenshtein distance implementations used in OpenRefine. We’ve been exploring three different implementations in this PR: Simile Vicino, Apache Commons Text, and the PassJoin algorithm. I’d like to gather input from the community on which would be the most suitable for our needs.
1. Simile Vicino:
- This has been our go-to implementation, but it’s relatively slow and has limitations in performance, especially with large datasets.
2. Apache Commons Text:
- Offers a more optimized implementation that’s faster and more memory-efficient, but it seems to produce different clustering results compared to Simile Vicino, possibly due to different handling of whitespace and character boundaries.
3. PassJoin Algorithm:
- Known for its speed and efficiency, particularly with large datasets. However, it introduces complexity in handling certain string formats and boundaries. Its implementation might also differ from the traditional Levenshtein distance that some users expect.
Given these considerations, I’d like to open the floor for discussion:
- Which algorithm do you think should be the default in OpenRefine, considering our typical use cases?
- Should we provide users with the option to choose between these algorithms in the UI, depending on their specific needs?
- Are there any other factors or potential pitfalls we should consider in this decision?
Feel free to share your experiences, concerns, or any additional thoughts you might have.