Clustering based on several columns as conditions

dqlearner · January 22, 2023, 7:02pm

I have an excel file where I have two columns. Company Name and Country. How can I cluster the company names by country. Basically I am trying to find similar/duplicate records by country. I mean, it should check for all similarities and duplicates within the country.

Example:

Company Name -Country
-Microsoft Corp -US
-Microsoft Corp -US

-Microsoft Corp -US
-Microsoft Corp -CA

First is duplicate, should be in cluster. Second isn’t a duplicate and should not be in the cluster.

Basically I am trying to find similar/duplicate records by country. I mean, it should check for all similarities and duplicates within the country.

b2m · January 23, 2023, 7:28am

Just some follow up questions to give you a valid strategy for your data:

Roughly how many different countries do you have?
Are the company names written consistently or do you may have variations in spelling, acronyms, …
What do you want to do with your duplicates? Remove them? Mark them? Combine them? Make the spelling consistent?
Can you please add numbers or some other id to your example corporations in your question and refer to them in the text? It is ambiguous what exacly you mean by “first” and “second”.

dqlearner · January 26, 2023, 12:19am

To answer your questions:

We have a more than 140 Countries.
Yes, lot of variations in spellinigs, acronyms, exact duplicates, with comma, dots, spaces, etc.
With the duplicates, I want to export to Excel the whole Cluster. Can't be solved in OpenRefine because we have lots of transactions linked to those accounts and manual work needs to be done. The work of OpenRefine is to cluster the duplicates and potential duplicates.

Id-CompanyName -Country-Status

Microsoft Corp -US-Active
Microsoft Corp -US-Active
Microsoft Corp -US-Active
Microsoft Corp -CA-Active

So I would like to cluster the potential duplicates by country and not just by CompanyName.

thank you!!

b2m · January 26, 2023, 7:32am

I am assuming that you expect 1 to 3 to be in one cluster and 4 to be in a separate cluster.

Currently the only way OpenRefine supports direct clustering on a subset of data would be to filter the data using a filter or facet and then perform the clustering. But with ~140 countries that would take ~140 clustering steps.

So a simpler approach I am usually using is to:

cluster each column until the data is consistent per column
combine the now consistent columns in to a new column named SortBy
Move the new column SortBy to the begining
sort by SortBy
make the sort permanent
Blank down the column SortBy

I then end up having all the clustered data in a record, where I can process them together, add a record id column, ...

SortBy	Id	CompanyName	Country	Status
Microsoft Corp-CA	4	Microsoft Corp	CA	Active
Microsoft Corp-US	1	Microsoft Corp	US	Active
	2	Microsoft Corp	US	Active
	3	Microsoft Corp	US	Active

There sure are other working strategies, but this is the one I am usually going for.

dqlearner · February 5, 2023, 8:59pm

Hi, thank you so much! I followed what you said and looks like it’s working. It doesn’t mix with different countries with same CompanyName, so, so far so good!

Topic		Replies	Views
Record Linkage and Entity Resolution Development & Design	3	478	November 21, 2023
Cluster and edit function returns similar values Support and Helpdesk	3	70	June 12, 2024
Merging and dedublicating Data cleaning and transformations	6	1092	December 2, 2022
Export clustered items as seen in application Support and Helpdesk	1	220	June 1, 2023
Duplicating rows problem Support and Helpdesk	9	66	June 11, 2025

Clustering based on several columns as conditions

Related topics