User-defined Clustering Project

I believe selecting pairs with the smallest distance, particularly those with a distance of less than 5, will offer meaningful examples by highlighting subtle differences that the clustering function can effectively capture. What do you think of that?

P.S. It seems like we're on the same page. We could calculate all the distances and then showcase the first 10 clusters with a distance of less than 5.

I think this would be a better option if we decided to display multiple pairs. However, if we're only showing one pair, it would be more effective to present them in a text box with proper text wrapping and the ability to resize the box, as Owen suggested.

The main question there is: how do you select the pairs of value1 and value2 to run the preview on? Remember that all the values are coming from the same column. The "interesting" pairs of values to use as examples are those for which the values are reasonably close… but how should the system select those?

I think pairs with distances which should be small, but aren't due to algorithm bugs, are also interesting. It might be worth talking to people who develop distance algorithms to understand what types of test data sets they typically use. One possibility might be to use two columns during algorithm development with one column for value1 and a separate one for column 2 so that the developer can have explicit control over the pairing of test values. If you stick with a single column, perhaps recommend that the test data set be small enough that an all vs all computation happens quickly and has results which completely fit in the preview window.

Tom

Regarding the problem of selecting the values for previewing custom distances. We considered various approaches:

  • selecting them based on levenshtein clustering (independently of the custom distance provided). The pairs of values used for previewing the expression would not change as the user changes the expression.
  • selecting them based on the custom distance provided, showing distances between values from the generated clusters (possibly generated with a slightly higher distance threshold)
  • selecting them based on just “pre-clustering” using the blocking chars
  • let the user type the example values themselves, with two input fields which are pre-filled with the same initial value (the first value in the column)

@antonin_d and I agreed on the last option, where the result updates dynamically as the input values change.

Here it is. Let me know what you think!

Hi OpenRefine Community,

As my internship has come to an end and the feature is now in the hands of users, I just want to say a big thank you for the opportunity to contribute this year. I’ve really enjoyed working with all of you.

Special thanks to @antonin_d, @ostephens, @b2m, @thadguidry, @tfmorris, and @Martin – your support made this possible!

2 Likes

Thank you for your work and patience and good luck for the future @zyadtaha

1 Like

You'll go far @zyadtaha. Keep looking at the big picture when the little details become to much. Take those big chunks and break into small chunks and much smaller bite sized chunks where you can easily eat (program) with joy, is my recommendation.

1 Like

| zyadtaha
September 17 |

  • | - |

As my internship has come to an end and the feature is now in the hands of users, I just want to say a big thank you for the opportunity to contribute this year. I’ve really enjoyed working with all of you.

Congratulations on your successful internship! We appreciate your contributions (and Antonin's mentorship which helped make them possible). We look forward to future contributions from you. Perhaps next GSoC you'll want to be a mentor!

Tom

1 Like