Clustering based on several columns as conditions

I have an excel file where I have two columns. Company Name and Country. How can I cluster the company names by country. Basically I am trying to find similar/duplicate records by country. I mean, it should check for all similarities and duplicates within the country.

Example:

Company Name -Country
-Microsoft Corp -US
-Microsoft Corp -US

-Microsoft Corp -US
-Microsoft Corp -CA

First is duplicate, should be in cluster. Second isn’t a duplicate and should not be in the cluster.

Basically I am trying to find similar/duplicate records by country. I mean, it should check for all similarities and duplicates within the country.

Just some follow up questions to give you a valid strategy for your data:

  1. Roughly how many different countries do you have?
  2. Are the company names written consistently or do you may have variations in spelling, acronyms, …
  3. What do you want to do with your duplicates? Remove them? Mark them? Combine them? Make the spelling consistent?
  4. Can you please add numbers or some other id to your example corporations in your question and refer to them in the text? It is ambiguous what exacly you mean by “first” and “second”.

To answer your questions:

  1. We have a more than 140 Countries.
  2. Yes, lot of variations in spellinigs, acronyms, exact duplicates, with comma, dots, spaces, etc.
  3. With the duplicates, I want to export to Excel the whole Cluster. Can't be solved in OpenRefine because we have lots of transactions linked to those accounts and manual work needs to be done. The work of OpenRefine is to cluster the duplicates and potential duplicates.

Id-CompanyName -Country-Status

  1. Microsoft Corp -US-Active
  2. Microsoft Corp -US-Active
  3. Microsoft Corp -US-Active
  4. Microsoft Corp -CA-Active

So I would like to cluster the potential duplicates by country and not just by CompanyName.

thank you!!

I am assuming that you expect 1 to 3 to be in one cluster and 4 to be in a separate cluster.

Currently the only way OpenRefine supports direct clustering on a subset of data would be to filter the data using a filter or facet and then perform the clustering. But with ~140 countries that would take ~140 clustering steps.

So a simpler approach I am usually using is to:

  1. cluster each column until the data is consistent per column
  2. combine the now consistent columns in to a new column named SortBy
  3. Move the new column SortBy to the begining
  4. sort by SortBy
  5. make the sort permanent
  6. Blank down the column SortBy

I then end up having all the clustered data in a record, where I can process them together, add a record id column, ...

SortBy Id CompanyName Country Status
Microsoft Corp-CA 4 Microsoft Corp CA Active
Microsoft Corp-US 1 Microsoft Corp US Active
2 Microsoft Corp US Active
3 Microsoft Corp US Active

There sure are other working strategies, but this is the one I am usually going for.

2 Likes

Hi, thank you so much! I followed what you said and looks like it’s working. It doesn’t mix with different countries with same CompanyName, so, so far so good!