Cluster and edit facet automation

Hi,

I hope you are going well!

I came across Openrefine recently and it is very useful for the bibliometric work that I am currently pursuing (version used : Openrefine 3.8.2).

More precisely, I am using the facet called "Cluster and edit" to find titles of articles/books in my corpus that are very similar and yet don't share the same ID (an identifier that is useful to tell if two articles/books are the same or not). However, the number of clusters I get when using this facet is very big. I am able to cluster around a 1000 clusters at a time and the total number of clusters is aroud 170 000 :sweat_smile:. Manually doing this operation over and over again is very tedious and far from being time efficient :

Is there be a way to automate this process with a script for example? I am not well versed in programming, but I would like to achieve something that automatically 1) select all the cluster, 2) merge all the selected clusters, 3) repeat that operation until no clusters are left.

Do you think it is archivable? Any help will be appreciated :smiley:
Best ,

Jacob

There is a preference setting called ui.clustering.choices.limit which defaults to 5000 that you can try increasing (gradually!) if you think your browser can handle the load.

Are you reviewing the candidate clusters? Choosing which value you want to use for the cluster? The workflow is really designed for you to have the opportunity to do both of those things and 1,000 clusters is already a lot to review at one time.

To answer your direct question, no, there isn't a way to automate this, but perhaps increasing the limit will help if you don't care about reviewing the candidates.

Tom

Hi,

Thank you so much for your answer! I am not reviewing the clusters, given the fact that the clustering algorithm that I use is Fingerprint (which is the less prone to produce false positives to my knowledge). I will try to increase gradually the limit of the clustering choices.

Thanks again :slight_smile: ,
Jacob