User-defined Clustering Project

Hi OpenRefine Community,

I’m Zyad Taha, the Google Summer of Code intern, also a first-year computer engineering student from Egypt. I will be working this summer on a project that will allow users to make User-defined Clustering under the mentorship of @antonin_d .

Here is a post to discuss the design of the feature with you.

Project Details

OpenRefine offers predefined functions for computing clusters. These functions may not suit all user needs, but with some knowledge of the data, users could come up with a better algorithm and express it as a GREL/Jython/Clojure function, as mentioned in this issue.

This project will enable custom expressions for binning and kNN-based clustering; we can offer more personalized and precise clustering to the users. Providing them with the ability to implement their own solutions can lead to a more engaged user base, potentially contributing suggestions for new features or improvements based on their experiences.

Project Design

I suggest adding a button in the clustering dialog, when the user clicks on it, a new expression dialog appears to let the user input their custom expressions.

The proposed design for the button in the Key Collision method is:


The proposed design for the button in the Nearest Neighbor method is:


When clicking on the button, These new expression dialogs will appear to let the user input their custom expressions and give a name to the custom function.

The proposed design for the expression dialog to add a keying function is:


The proposed design for the expression dialog to add a distance function is:


Overall, the user flow will look as follows:

  1. Selecting ‘key Collision’ or ‘Nearest Neighbor’ method; if not, ‘key Collision’ is selected by default.
  2. Selecting the ‘Add your own function’ button.
  3. The new expression dialog shows.
  4. Choose the expression language (GREL, JYTHON, or CLojure).
  5. Write the custom function
  6. Press the OK button.
  7. The expression dialog disappears.
  8. Select the 'merge' checkbox for wanted clusters.
  9. Execute the clustering of data according to the custom function.

I am excited to get to know each one of you and hear your opinions/feedback on the design. I hope that this project will help OpenRefine grow more and more.

2 Likes

I like the idea of being able to provide customized keying and distance functions. It somehow goes along with the concept of Facets in OpenRefine, where you can easily provide custom expressions for special purposes.

Question 1: Are the customized functions somehow persisted, or only available on a one-time basis?
Question 2: For calculating the distance you need (at least) two values. I guess this is obvious but somehow confusing in the proposed design for the expression dialog for the distance function. So will there be a value1 and value2?
Question 3: If the customized functions somehow get persisted, would it be possible to somehow export and import them?

1 Like

Question 1: Are the customized functions somehow persisted, or only available on a one-time basis?

  • Yes we should let them be persisted with the ability to delete them later.

  • This can be done by adding each new function to the functions' options list in the clustering dialog with a small x icon to be able to delete it.
    Untitled design

  • Or maybe we should add something like the starred tab in the expression dialog. What do you think?

Question 2: For calculating the distance you need (at least) two values. I guess this is obvious but somehow confusing in the proposed design for the expression dialog for the distance function. So will there be a value1 and value2 ?

  • The default expression of “value” won't work for distance-based clustering, so yes, the idea of having “value1” and “value2” makes sense!
  • We’d need a default expression like “editDistance(value1, value2)” in GREL, so I have opened an issue to suggest providing it.
  • Here's how it will look like:

Question 3: If the customized functions somehow get persisted, would it be possible to somehow export and import them?

  • I am not sure if we need much more beyond letting the user see the custom keyer and distances’ expressions?
  • Or perhaps a way to export all custom keyer and distances at once?

Thanks for your great feedback!

2 Likes

Personally I think that adding them as shown in the drop down menu is perfect.
In case the number of custom expressions grows there are also searchable drop down menus...

Maybe you can even move the "Add new..." functionality as last entry into the drop down menu?
This way we save a lot of valuable space in the already too crowded clustering dialog window.

There are people using a lot of different OpenRefine instances. For example one docker based OpenRefine instance per project.
Manually synchronizing custom keying and distance functions in such settings would be quite a pain.

So having a way to export and import these for example analogous to exporting and importing (steps from) the history from projects would be quite helpful. This would also help with sharing algorithms between users.

1 Like

Thanks for the details @zyadtaha

I think we need both.

  • The drop-down for quick access
  • The operation GREL history tab (with the option to star) to actually see what the expression was. My guess is that this would be natively supported when reusing the calling the GREL modal in the add a new distance function.

I believe that incorporating the clustering preference should be considered when we enhance user preference capabilities and should be considered for phase 2 or 3. Overall, I will be careful not to expand the scope of the GSoC project too much.

Yes totally agree. Just trying to think ahead and sometimes already having future requirements might help with design or implementation choices.

1 Like

I support @Martin's idea of trying to reuse the existing expression dialog (which comes with Preview, History and Favourite tabs for the expressions) would be great. It would also let the user select which expression language to use (it does not have to be GREL, I would say).

In the case of a custom distance, we would need to find a way to explain to the user which variables they can use (value1 and value2, say) - perhaps it's good enough to have a default expression in the editor that contains those two values (say, editDistance(value1, value2)).

To integrate the "delete" button in the drop-down menu (and perhaps also the "Add new keying function" button, which I think would make sense) I think we'll probably need to move away from the current native drop-down selection menu. It's probably doable but maybe quite some work if it needs to be developed from scratch (I wouldn't really know any similar widget we could reuse).

I'm only just catching up with this proposal and apologies to @zyadtaha who has already made progress with this.

Something that I think is missing from the proposal above is the ability to Edit an existing function. I think this could be necessary as you may realise as you work with a function that it could be improved by making some changes.

With the designs agreed above the only option for adding an edit would be to have another option in the dropdown (alongside delete). This feels like too much to me (I'm not that keen on the delete being in such proximity to the select either to be honest). I wonder if we might consider introducing a "Manage clustering functions" somewhere in the UI? This could be somewhere separate to the actual clustering dialogue (e.g. in Preferences?) and give space for us to expand functionality in future if necessary (e.g. I can imagine it being useful to be able to export/import a set of clustering functions to share with others or between OR installations)

1 Like

I think changing the "Add new keying/distance function" buttons to be something like "Manage xxx functions" (or any similarly appropriate graphical icon) could be used to bring up a dialog with a full range of add/edit/delete functionality and avoid having to overload the pulldown menu. That would allow for a lot more room and freedom to design.

Tom

1 Like

Thanks for your valuable feedback. @ostephens @tfmorris

After considering how to avoid overloading the pulldown menu, I think we can include just two options in the clustering dialog. The first option would show the name of the function used for clustering.

The second option would be to manage the clustering functions (I prefer calling it 'Manage functions' rather than Owen's suggestion, 'Manage clustering functions,' to keep it concise). Clicking on this option would open a dialog showing a list of available functions, including custom ones, with options to search, select, add, delete, or edit them.

What do you think?

Thanks @zyadtaha . My opinion is we keep the "Manage functions" and selecting the function to use separate. My reasons are:

  • This makes the "Manage functions" simpler (we don't need "Select") and opens up potential of using outside the Clustering context in the future (e.g. if we decide to support functionality outside Clustering)
  • It keeps the keying/clustering function dropdowns simple (no need for any custom component here)
  • It is more intuitive IMO for the dropdown to handle the selection, and the manage component to handle just the management

Essentially I prefer your original design for adding a button to the Clustering dialog for the manage functions option. If there is still the concern as raised by @b2m that this makes the Clustering dialog too crowded then I'd be in favor of moving the option somewhere outside the Clustering dialog than making the Select dropdown more complicated. Moving the "Manage functions" option outside the Clustering dialog is my preference really as it keeps Clustering as a focussed function and for the simple case of a user just using the existing cluster functions there is no distraction - but I could be happy with your original design with the button label changed to "Manage functions"

Integrating the option to manage the functions into the clustering dialog, especially the 'create new' feels unnecessary to me - as designing a good distance or keying algorithm is going to take some effort and not something a user is likely to do "on the fly" in the middle of a clustering workflow IMO

1 Like

If we remove the "Select" option and have multiple custom functions, how do we let the user choose one?

Also, shouldn't we show the name of the custom function used for clustering? (Maybe we can add an option in the dropdown to display the name of the function used?)

I prefer the "Manage functions" option to be in the clustering dialog. However, I'm interested in hearing your ideas on where to move it in the UI instead of under "Preferences," as I'm not very keen on that placement.

I think the drop-down select could show:

  • all the built-in clustering functions
  • all the user-defined ones

and the button to manage user-defined ones would be outside of the select.

2 Likes

Agree with @antonin_d:

I think the drop-down select could show:

  • all the built-in clustering functions
  • all the user-defined ones
    So the user would always select from that menu - exactly the same as currently but with the option to select any user defined functions that exist
1 Like

Having looked I agree - Preferences definitely not the right place.
I think the best place for now is in the Clustering dialog as a button as you proposed in your original post in this thread

1 Like

There is a new PR with an updated design - have a look and tell @zyadtaha what you think of it!

Thanks @antonin_d @zyadtaha

I like these new designs and very happy if these go ahead as described. However a few questions/comments that we might want to consider which are slightly broader than the PR:

  1. Will there be separate "manage dialogs" for Keying vs Nearest Neighbour functions? Or will they be managed all together with some obviously labelling as being one or the other?
  2. Following previous discussions is the plan that once you have created a function it will appear in the dropdown in the clustering dialog alongside all the other choices?
  3. Is the preview window the right feedback mechanism when creating a new clustering function? Are there any alternatives that can be practically investigated? Perhaps this applies more to Nearest Neighbour than the Keying options although it could be (for example) in both cases be useful to be able to "view clusters" before saving?
  4. Do we want to give any consideration (now or later) to having a more general "save custom function" option, where the user can write and save a transformation expression and then re-use? If so should that inform any work in this project
1 Like

Well, how about we make two tabs: one for keying and another for Nearest Neighbour? Similar to the tabs in the expression preview dialog (like history, starred, preview, help). What do you think?

Yes, it should also show up in the management dialog.

Thanks for pointing this out. We've simplified it for now and plan to enhance it later. I agree with your suggestion for "view clusters." What do you think? @antonin_d

Could you please explain more? I'm not sure I understand what you mean.

one for keying and another for Nearest Neighbour? Similar to the tabs in the expression preview dialog (like history, starred, preview, help). What do you think?

That sounds like a sensible approach to me.

I agree with your suggestion for "view clusters." What do you think? @antonin_d

I think the ability to preview clusters while improving the expression would be really great. Without this, people will likely need a lot of back-and-forth between the clustering dialog and the expression editor to tweak the expression to what they need.

In our weekly chat today @zyadtaha proposed having this in a tab alongside the classic expression preview tab. I think this could potentially work, but I can imagine there could be many other sensible designs. The potential problems I can think of are:

  • we might be a bit short of space: the preview tab is rather small, and while it works to show the results of an expression, displaying a cluster of values takes a bit more space. Ideally, the user should be able to get a sense of the shape of the clusters without having to do too much scrolling
  • in the case of a custom distance function, recomputing the clusters on the fly might be a bit expensive, since the expression needs to be evaluated on many pairs of values to form the clusters. We might need to have the user refresh the previews manually (like in the clustering dialog itself)

Proposal design

  • The add button will be labeled "Add new keying function" for Keying functions, and "Add new distance function" for Distance functions.

What are your thoughts? @antonin_d, @ostephens