2025 Barcamp Session Proposal: New Customized Clustering Feature since 3.9

b2m · September 8, 2025, 3:26pm

Description

Since OpenRefine 3.9 we have a new feature that enables us to define our own clustering functions.

Since the release there is not a lot talk about this feature. Also no user questions in the forum. Actually the feature was broken from the first release and nobody noticed for quite some time.

As I personally really enjoy this feature I suggest to give short presentation on what to do with this feature and then discuss on what is missing.

Format

Guided workshop. Probably 30-45 minutes long.

Session goals

Learn about this new feature and identify use cases.

Martin · September 9, 2025, 6:46pm

The shared etherpad with the notes from the call is available here Etherpad

Martin · March 12, 2026, 8:37pm

clean up notes from the pad

Custom clustering functions

The new clustering feature introduced in OpenRefine 3.9 allows users to define custom clustering functions. It is possible to combine several clustering algorithms into a single function to speed up the process, for example:

fingerprint(value).ngramfingerprint(value)

Custom clustering functions can also use standard expressions. For example, the replace() function can remove terms that may interfere with clustering.Example use case: removing common words such as “Place” or “Street” when clustering place names.

Extending clustering with external services

Participants also discussed calling external services to expand functionality. One approach is to call external functions via FastAPI using Jython in OpenRefine. For example

These services can extend clustering or matching capabilities beyond what is available directly in OpenRefine.

Documentation ideas

Thad suggested adding a “Guidelines” subsection in the clustering documentation to explain cases where clustering is not recommended.

There was also a suggestion to publish related material in Programming Historian, which would make it easier to translate tutorials and adapt examples for other languages. Example lesson referenced: Clustering with Scikit-Learn in Python | Programming Historian (see also the Portuguese translation)

Visualizing clustering

Participants discussed possible ways to visualize how clusters are formed.

Silvia shared a Sankey diagram illustrating how different variants are merged into a single value:

[Album] imgur.com

Thad suggested that chord diagrams could also be an alternative way to visualize clustering results.

Topic		Replies	Views
User-defined Clustering Project Development & Design	47	1153	September 17, 2024
Cluster and edit facet automation Data cleaning and transformations	2	147	July 18, 2024
Question about exporting clusters Support and Helpdesk	2	192	April 24, 2024
Inviting Benjamin to the GitHub organization Development & Design election	1	84	April 8, 2025
Clustering method selection: immediate start Support and Helpdesk	1	88	March 27, 2025

2025 Barcamp Session Proposal: New Customized Clustering Feature since 3.9

Custom clustering functions

Related tutorials and resources

Extending clustering with external services

Documentation ideas

Visualizing clustering

Related topics