Cluster and edit facet automation

JacobHamelMottiez · July 13, 2024, 8:21pm

Hi,

I hope you are going well!

I came across Openrefine recently and it is very useful for the bibliometric work that I am currently pursuing (version used : Openrefine 3.8.2).

More precisely, I am using the facet called "Cluster and edit" to find titles of articles/books in my corpus that are very similar and yet don't share the same ID (an identifier that is useful to tell if two articles/books are the same or not). However, the number of clusters I get when using this facet is very big. I am able to cluster around a 1000 clusters at a time and the total number of clusters is aroud 170 000 . Manually doing this operation over and over again is very tedious and far from being time efficient :

Is there be a way to automate this process with a script for example? I am not well versed in programming, but I would like to achieve something that automatically 1) select all the cluster, 2) merge all the selected clusters, 3) repeat that operation until no clusters are left.

Do you think it is archivable? Any help will be appreciated
Best ,

Jacob

tfmorris · July 15, 2024, 11:12pm

There is a preference setting called ui.clustering.choices.limit which defaults to 5000 that you can try increasing (gradually!) if you think your browser can handle the load.

Are you reviewing the candidate clusters? Choosing which value you want to use for the cluster? The workflow is really designed for you to have the opportunity to do both of those things and 1,000 clusters is already a lot to review at one time.

To answer your direct question, no, there isn't a way to automate this, but perhaps increasing the limit will help if you don't care about reviewing the candidates.

Tom

JacobHamelMottiez · July 18, 2024, 9:14pm

Hi,

Thank you so much for your answer! I am not reviewing the clusters, given the fact that the clustering algorithm that I use is Fingerprint (which is the less prone to produce false positives to my knowledge). I will try to increase gradually the limit of the clustering choices.

Thanks again ,
Jacob

Topic		Replies	Views
Clustering method selection: immediate start Support and Helpdesk	1	32	March 27, 2025
Clustering based on several columns as conditions Data cleaning and transformations	4	699	February 5, 2023
User-defined Clustering Project Development & Design	47	299	September 17, 2024
Data Cleaning and Transformation Automation and Performance Data cleaning and transformations	4	429	May 3, 2023
Question about exporting clusters Support and Helpdesk	2	132	April 24, 2024

Cluster and edit facet automation

Related topics