I have an excel file where I have two columns. Company Name and Country. How can I cluster the company names by country. Basically I am trying to find similar/duplicate records by country. I mean, it should check for all similarities and duplicates within the country.
Example:
Company Name -Country
-Microsoft Corp -US
-Microsoft Corp -US
-Microsoft Corp -US
-Microsoft Corp -CA
First is duplicate, should be in cluster. Second isn’t a duplicate and should not be in the cluster.
Basically I am trying to find similar/duplicate records by country. I mean, it should check for all similarities and duplicates within the country.
Just some follow up questions to give you a valid strategy for your data:
Roughly how many different countries do you have?
Are the company names written consistently or do you may have variations in spelling, acronyms, …
What do you want to do with your duplicates? Remove them? Mark them? Combine them? Make the spelling consistent?
Can you please add numbers or some other id to your example corporations in your question and refer to them in the text? It is ambiguous what exacly you mean by “first” and “second”.
Yes, lot of variations in spellinigs, acronyms, exact duplicates, with comma, dots, spaces, etc.
With the duplicates, I want to export to Excel the whole Cluster. Can't be solved in OpenRefine because we have lots of transactions linked to those accounts and manual work needs to be done. The work of OpenRefine is to cluster the duplicates and potential duplicates.
Id-CompanyName -Country-Status
Microsoft Corp -US-Active
Microsoft Corp -US-Active
Microsoft Corp -US-Active
Microsoft Corp -CA-Active
So I would like to cluster the potential duplicates by country and not just by CompanyName.
I am assuming that you expect 1 to 3 to be in one cluster and 4 to be in a separate cluster.
Currently the only way OpenRefine supports direct clustering on a subset of data would be to filter the data using a filter or facet and then perform the clustering. But with ~140 countries that would take ~140 clustering steps.
So a simpler approach I am usually using is to:
cluster each column until the data is consistent per column
combine the now consistent columns in to a new column named SortBy
Hi, thank you so much! I followed what you said and looks like it’s working. It doesn’t mix with different countries with same CompanyName, so, so far so good!