Record Linkage and Entity Resolution

dqlearner · August 20, 2023, 9:44pm

I noticed that Open Refine would be more complete as Data Management and Cleaning tool if we can implement functions to solve common data management/cleansing problem taking advantage of it's powerful clustering algorithm:
-Entity Resolution or Deduplication.
-Record Linkage between two datasets

olea · August 23, 2023, 5:43pm

Interesting. Could you tell more about this?

dqlearner · November 16, 2023, 11:55pm

Entity Resolution or Deduplication: This means to detect and group (clustering) duplicate records for example in a customer table (written in different forms). Current clustering method works great when we want to compare a single column, but not multiple columns (Name, Address, City etc).
Also the current method does not allow to export the clustering result. It will be useful if it can group the clusters and give a number for each of them and be able to export as csv for managing in excel later.

Record Linkage between two data sets: Means beasically the concept of re-conciliación of data. But what I really mean is to to match similar records between two data sets.
Customer table A, Customer table B and get the mathcing results with matching score.

These two implementation/function would give a great boost for Data Management use case.

Martin · November 21, 2023, 8:09pm

@dqlearner, thanks for the details.

Entity Resolution or Deduplication
You can export the cluster results in JSON for a separate analysis (see my screenshot below).

It could be interesting to see how clustering on multiple columns could work. The current workaround is to concatenate the columns, but we are losing some granularity with this approach. I think this is something worth exploring further.

Regarding Record Linkage, OpenRefine support of reconciliation services should help a lot. If you want to reconcile against your own dataset you can start by looking at csv base reconciliation service:

Topic		Replies	Views
Clustering based on several columns as conditions Data cleaning and transformations	4	708	February 5, 2023
Merging and dedublicating Data cleaning and transformations	6	1055	December 2, 2022
How can I reconcile from my own CSV file? Support and Helpdesk reconciliation	3	905	May 31, 2023
Reconciliation API / data extension Data cleaning and transformations	1	306	April 3, 2023
Two ways of feature extraction: 1. split multi-value cells; 2. split into multiple columns Data cleaning and transformations	6	1233	March 24, 2023

Record Linkage and Entity Resolution

Related topics