2025 Barcamp Session Proposal: New Customized Clustering Feature since 3.9

Description

Since OpenRefine 3.9 we have a new feature that enables us to define our own clustering functions.

Since the release there is not a lot talk about this feature. Also no user questions in the forum. Actually the feature was broken from the first release and nobody noticed for quite some time.

As I personally really enjoy this feature I suggest to give short presentation on what to do with this feature and then discuss on what is missing.

Format

Guided workshop. Probably 30-45 minutes long.

Session goals

Learn about this new feature and identify use cases.

The shared etherpad with the notes from the call is available here Etherpad

clean up notes from the pad

Custom clustering functions

The new clustering feature introduced in OpenRefine 3.9 allows users to define custom clustering functions. It is possible to combine several clustering algorithms into a single function to speed up the process, for example:

fingerprint(value).ngramfingerprint(value)

Custom clustering functions can also use standard expressions. For example, the replace() function can remove terms that may interfere with clustering.Example use case: removing common words such as “Place” or “Street” when clustering place names.

Related tutorials and resources

Examples and tutorials related to clustering in German:

Extending clustering with external services

Participants also discussed calling external services to expand functionality. One approach is to call external functions via FastAPI using Jython in OpenRefine. For example

These services can extend clustering or matching capabilities beyond what is available directly in OpenRefine.

Documentation ideas

Thad suggested adding a “Guidelines” subsection in the clustering documentation to explain cases where clustering is not recommended.

There was also a suggestion to publish related material in Programming Historian, which would make it easier to translate tutorials and adapt examples for other languages. Example lesson referenced: Clustering with Scikit-Learn in Python | Programming Historian (see also the Portuguese translation)

Visualizing clustering

Participants discussed possible ways to visualize how clusters are formed.

Silvia shared a Sankey diagram illustrating how different variants are merged into a single value:

Thad suggested that chord diagrams could also be an alternative way to visualize clustering results.