Data Cleaning and Transformation Automation and Performance

Hi Everyone,

I am working on bibliometric analysis and normalize bibliometric data over 34,000 rows of References data. Every References data commprises of 100-256 characters. I am using Open Refine 3.7.2 using 64 bit java software in 64 bit Intel platform running Windows 11 Desktop or Windows Server 2022.

I am using Method: nearest neighbor with Levengstein/PPM distance function. to seek potential duplicate and merged those References Data since it detected better for long string of references (I think :slight_smile: ).

My challange was is required many user intervention and take very long time to process. is there any way for me to automate and increase the processing performance?

What I meant by automating:

  1. Is there any script I can use to automate: when the analysis completed, the open refined automatically select ALL the duplicate, Merge and Recluster until there was not duplicate? Including as well changing the distance function from from Levengstein to PPM and start the analysis and automate the merging and reclustering until there is not duplicate.
  2. Is there any way that I can start with the Method: nearest neighbor with PPM distance without need to go for Levengstein? As currently I only can use PPM after using the Levengstein. When I change the method to Nearest Neighbor there is no option to select PPM until I completed the Levengstein analysis process.

Improving Performance Means
It takes hours (24 hours++) to process the data but taking alook at the processor utilization, it might not be an efficient process … as the utilization spike to 100% and then slowly reduce to 0% when one job completed. As I am renting cpu power hourly from AWS, this is become a problem as most of the time, my server has low utilization with long hours of running. Any suggestion on the configuration or how to conduct such large number of data with more efficent processing with implication lower hours of rent?

What I did right now is to split the database using filter and run separete instances as listed in the picture. I have not explore yet how the algoritm works, but would it become a better solution just to rent large number of vcpu and run at the same time instead splitting them.

Any better way to run these kind of processing? Any alightment or direction I can explore to seek answers?

Thanks you.

Hi @putomo,

Thanks for sharing your use case with us! Here are some thoughts on the questions and points you raise.

Is there any script I can use to automate: when the analysis completed, the open refined automatically select ALL the duplicate, Merge and Recluster until there was not duplicate? Including as well changing the distance function from from Levengstein to PPM and start the analysis and automate the merging and reclustering until there is not duplicate.

I am not aware of a simple way to do this in OpenRefine directly. The user interface is built around the assumption that you normally want to review clustering candidates before merging them.
Intuitively, this seems to be a good use case for external tools which automate OpenRefine. We list some of them on our website. For instance, you might be able to do this with openrefine-client, but I have not checked.

Is there any way that I can start with the Method: nearest neighbor with PPM distance without need to go for Levengstein? As currently I only can use PPM after using the Levengstein. When I change the method to Nearest Neighbor there is no option to select PPM until I completed the Levengstein analysis process.

I guess we could make it so that the dialog opens with the latest used settings. We have done this for other dialogs, by storing some of their settings in the preferences (see #2187 for instance). It could be worth opening an issue about this. Would you like to do it?

It takes hours (24 hours++) to process the data but taking alook at the processor utilization, it might not be an efficient process … as the utilization spike to 100% and then slowly reduce to 0% when one job completed. As I am renting cpu power hourly from AWS, this is become a problem as most of the time, my server has low utilization with long hours of running. Any suggestion on the configuration or how to conduct such large number of data with more efficent processing with implication lower hours of rent?

There has been the proposal to integrate a more efficient algorithm for nearest neighbors but sadly the first attempt to integrate it failed because the underlying library was not packaged properly. It would be an interesting project to try to implement this again (the algorithm is publicly described and there are open source implementations around, apparently).

I also have it on my roadmap to work on improving clustering in the 4.0 version, which is designed to work on larger datasets. Sadly I do not think the work I will do there will solve your problem, since those improvements are about better memory management. Looking at your screenshot you are dealing with about 34,000 rows. This normally fits easily in memory: the problem is the asymptotic complexity (quadratic in the number of rows) of our nearest neighbors implementation, and perhaps the length of the strings you give it too.

Hi Prio,

I am working on bibliometric analysis and normalize bibliometric data over 34,000 rows of References data. Every References data commprises of 100-256 characters.

Sounds like an interesting project!

  1. Is there any way that I can start with the Method: nearest neighbor with PPM distance without need to go for Levengstein? As currently I only can use PPM after using the Levengstein. When I change the method to Nearest Neighbor there is no option to select PPM until I completed the Levengstein analysis process.

That feature was first requested in 2010. You can track progress on the work using issue #241.

It takes hours (24 hours++) to process the data but taking alook at the processor utilization, it might not be an efficient process … as the utilization spike to 100% and then slowly reduce to 0% when one job completed.

How many cores are you using? How long does a single clustering operation take? Are you working on all 34,000 rows or just a subset (e.g just 1259 rows beginning with XYZ as shown in your screen capture)? I'd be curious to see what the shape of the CPU utilization curve looks like. The KNN clusterer uses a home-grown multithreaded implementation which long predates the modern Java thread support and it does a fixed partitioning of the work which may not be optimal (but at least it's multithreaded, unlike many parts of OpenRefine).

One thing that you can control with the current implementation is the size of blocks by varying the Ngram size. (longer Ngrams == more smaller blocks). Measuring pairwise distances within the block is quadratic in the size of the block, so you want smaller blocks for efficiency (but bigger blocks for the best chance of not missing any match candidates).

Any better way to run these kind of processing? Any alightment or direction I can explore to seek answers?

You are using a very general string matching algorithm to process what are actually quite structured strings, so I suspect you could probably do better with something tuned specifically for measuring the "distance" between two citations, but that would be a non-OpenRefine solution (or an extension).

Best regards,
Tom

Thanks you @antonin_d for the insights

in regard to the more efficient processing, i do not know whether there was an attempt make use the processor more efficienty with several openrefine instances. I attach “illustration” on how the processing utilization looks like during my process. Basically I want to utilize the lower utilization cpu where took longer than the peak.

Basically I am trying to find the balance between the number vcpu, the time and eventually impacting the cost. as AWS EC2 charge hourly …I tested from only using 4, 8, 16, 32, 48 vcpu EC2 Instances… the more vcpu, the time the server in the peak 100% state is lesser, mean more time in inefficient state (cpu utilization less than 100%)

Hi @tfmorris, thanks you for the direction.

I posted the ilustration in reply to @antonin_d in regard the ilustration of CPU Utilization, and I did tested 4,8,16,32,48 vcpu just to see the behavior. Higher number vcpu IMHO less efficient as the peak utilization only happen in very short period of time, the rest lower CPU utilization which is bad IaaS subscription as more hour to rent :slight_smile:

Initially I put all references in the calculation (start with A-Z), not knowing how the algorithm work I assume that the algoritme will compared All records which probably takes weeks. In the low number of vCPU sometime the system hang.

So I decide to split the references data based on alphabetically separate batch processing with consideration number of rows. e.g.
batch 1: all reference start with letter A with total record around 6000
batch 2: all reference start with letter B dan C with total record around 6000
batch 3: all references start with V,W,X,Y,Z with total record around 6000

in these way I can avoid the system hang, I know the processing is running, the problem is all manual, so there were many time I miss the estimate time to start the next batch…

I think Open-refine is power solution, just need to find a way on how to schedule processing batch that will automate the start of next batch which solve two of my problems

  1. Cpu utilization as I can estiamte to start the next batch 45 minutes after the first batch started.
  2. breakdown the processing into separate batch (subset of data) so make sure the server not hang