User-defined Clustering Project

Hi OpenRefine Community,

I’m Zyad Taha, the Google Summer of Code intern, also a first-year computer engineering student from Egypt. I will be working this summer on a project that will allow users to make User-defined Clustering under the mentorship of @antonin_d .

Here is a post to discuss the design of the feature with you.

Project Details

OpenRefine offers predefined functions for computing clusters. These functions may not suit all user needs, but with some knowledge of the data, users could come up with a better algorithm and express it as a GREL/Jython/Clojure function, as mentioned in this issue.

This project will enable custom expressions for binning and kNN-based clustering; we can offer more personalized and precise clustering to the users. Providing them with the ability to implement their own solutions can lead to a more engaged user base, potentially contributing suggestions for new features or improvements based on their experiences.

Project Design

I suggest adding a button in the clustering dialog, when the user clicks on it, a new expression dialog appears to let the user input their custom expressions.

The proposed design for the button in the Key Collision method is:


The proposed design for the button in the Nearest Neighbor method is:


When clicking on the button, These new expression dialogs will appear to let the user input their custom expressions and give a name to the custom function.

The proposed design for the expression dialog to add a keying function is:


The proposed design for the expression dialog to add a distance function is:


Overall, the user flow will look as follows:

  1. Selecting ‘key Collision’ or ‘Nearest Neighbor’ method; if not, ‘key Collision’ is selected by default.
  2. Selecting the ‘Add your own function’ button.
  3. The new expression dialog shows.
  4. Choose the expression language (GREL, JYTHON, or CLojure).
  5. Write the custom function
  6. Press the OK button.
  7. The expression dialog disappears.
  8. Select the 'merge' checkbox for wanted clusters.
  9. Execute the clustering of data according to the custom function.

I am excited to get to know each one of you and hear your opinions/feedback on the design. I hope that this project will help OpenRefine grow more and more.

1 Like

I like the idea of being able to provide customized keying and distance functions. It somehow goes along with the concept of Facets in OpenRefine, where you can easily provide custom expressions for special purposes.

Question 1: Are the customized functions somehow persisted, or only available on a one-time basis?
Question 2: For calculating the distance you need (at least) two values. I guess this is obvious but somehow confusing in the proposed design for the expression dialog for the distance function. So will there be a value1 and value2?
Question 3: If the customized functions somehow get persisted, would it be possible to somehow export and import them?

1 Like

Question 1: Are the customized functions somehow persisted, or only available on a one-time basis?

  • Yes we should let them be persisted with the ability to delete them later.

  • This can be done by adding each new function to the functions' options list in the clustering dialog with a small x icon to be able to delete it.
    Untitled design

  • Or maybe we should add something like the starred tab in the expression dialog. What do you think?

Question 2: For calculating the distance you need (at least) two values. I guess this is obvious but somehow confusing in the proposed design for the expression dialog for the distance function. So will there be a value1 and value2 ?

  • The default expression of “value” won't work for distance-based clustering, so yes, the idea of having “value1” and “value2” makes sense!
  • We’d need a default expression like “editDistance(value1, value2)” in GREL, so I have opened an issue to suggest providing it.
  • Here's how it will look like:

Question 3: If the customized functions somehow get persisted, would it be possible to somehow export and import them?

  • I am not sure if we need much more beyond letting the user see the custom keyer and distances’ expressions?
  • Or perhaps a way to export all custom keyer and distances at once?

Thanks for your great feedback!

2 Likes

Personally I think that adding them as shown in the drop down menu is perfect.
In case the number of custom expressions grows there are also searchable drop down menus...

Maybe you can even move the "Add new..." functionality as last entry into the drop down menu?
This way we save a lot of valuable space in the already too crowded clustering dialog window.

There are people using a lot of different OpenRefine instances. For example one docker based OpenRefine instance per project.
Manually synchronizing custom keying and distance functions in such settings would be quite a pain.

So having a way to export and import these for example analogous to exporting and importing (steps from) the history from projects would be quite helpful. This would also help with sharing algorithms between users.

1 Like

Thanks for the details @zyadtaha

I think we need both.

  • The drop-down for quick access
  • The operation GREL history tab (with the option to star) to actually see what the expression was. My guess is that this would be natively supported when reusing the calling the GREL modal in the add a new distance function.

I believe that incorporating the clustering preference should be considered when we enhance user preference capabilities and should be considered for phase 2 or 3. Overall, I will be careful not to expand the scope of the GSoC project too much.