I noticed that Open Refine would be more complete as Data Management and Cleaning tool if we can implement functions to solve common data management/cleansing problem taking advantage of it's powerful clustering algorithm:
-Entity Resolution or Deduplication.
-Record Linkage between two datasets
Interesting. Could you tell more about this?
Entity Resolution or Deduplication: This means to detect and group (clustering) duplicate records for example in a customer table (written in different forms). Current clustering method works great when we want to compare a single column, but not multiple columns (Name, Address, City etc).
Also the current method does not allow to export the clustering result. It will be useful if it can group the clusters and give a number for each of them and be able to export as csv for managing in excel later.
Record Linkage between two data sets: Means beasically the concept of re-conciliación of data. But what I really mean is to to match similar records between two data sets.
Customer table A, Customer table B and get the mathcing results with matching score.
These two implementation/function would give a great boost for Data Management use case.
@dqlearner, thanks for the details.
Entity Resolution or Deduplication
You can export the cluster results in JSON for a separate analysis (see my screenshot below).
It could be interesting to see how clustering on multiple columns could work. The current workaround is to concatenate the columns, but we are losing some granularity with this approach. I think this is something worth exploring further.
Regarding Record Linkage, OpenRefine support of reconciliation services should help a lot. If you want to reconcile against your own dataset you can start by looking at csv base reconciliation service: