Can I use OpenRefine to clean up first and last names that are not valid, such as random words?

Dominik_Bukovschak · May 17, 2024, 2:52am

Hello everyone,

I'm working with a dataset where some of the entries in the first and last name columns are not actual names, but rather random words that shouldn't be there. I'm considering using OpenRefine to clean up these entries and was wondering if anyone has experience with this. Specifically, I'd like to know:

Can OpenRefine effectively identify and correct entries that are clearly not names?
What features or techniques within OpenRefine would be most useful for removing non-name entries and ensuring that only valid names are retained?

Any guidance or tips on how to best approach this using OpenRefine would be greatly appreciated. Thank you in advance!

b2m · May 17, 2024, 7:15am

You can definitely do that with OpenRefine, but it kind of depends on how much "magic" you expect.

OpenRefine provides so called Facets to quickly get an overview of all the values in a column. In your case you would be able to scroll through a list of all the (unique) names in your name column (either sorted alphabetically or based on number of occurrences).

You can then use the Facet to either filter the dataset or directly (mass) edit the values.

Depending on how the "random" words where introduced into the dataset you might identify a pattern to somehow identify clusters of random words (like they all are lowercase, have numbers in them, have a certain length...).

In case you identify such a pattern you can combine Text filters with Facets to quickly drill down to the affected rows and then clean them.

In my experience this works quite well for "smaller" datasets, but becomes quite straining when you have more than 5k unique values in a column.

Depending on your dataset it might be worth a shot to try to Reconcile the names against Wikidata (or some other data source) and then only check the rows without a reconciliation result. But this works only, if the names in your dataset are likely to be found in a open data source like Wikidata.

Dominik_Bukovschak · May 19, 2024, 4:10pm

Thank you for the suggestions on using OpenRefine. I appreciate the insight into using Facets and Text filters, which seem very useful for managing smaller datasets.

However, in my case, I'm dealing with a column that contains both first and last names, and occasionally, random words are inserted instead of proper names. Because these random entries could be any word, filtering becomes a bit more complex as they don't follow a consistent pattern. They could be uppercase, lowercase, contain numbers, or be of various lengths, making it difficult to identify them based solely on those attributes.

Given this, I might have to apply a filter to all the words in the column to ensure I can identify and manage these random entries effectively. This approach might be necessary because the random words don’t correspond to typical name entries and could be literally anything, which complicates using standard reconciliation services like Wikidata since these random words aren't likely to match any reliable external data source.

Thanks again for your help. I'll continue exploring OpenRefine's features to see if I can tailor them more closely to the unique challenges of my dataset.

thadguidry · May 20, 2024, 2:01am

One way I've approached this before is to use a Custom Facet and our cross() function to return if a human first name/last name exists in the other projects column.

I typically use the CSV files from
philipperemy/name-dataset: The Python library for names. (github.com)
and do a cross().

But, more advanced, you can also (if giving OpenRefine enough memory ~8G) load the entire name-dataset library and use our Jython/Python to perform the lookups more directly using Python's in or not in operators which return True or False if a name or string exists in the dictionaries provided by the name-dataset library nd.first_names and nd.last_names

Something like:

from names_dataset import NameDataset, NameWrapper
return value in nd.first_names

This is a way to perform local reconcile string matching of most of the names around the world.

thadguidry · May 20, 2024, 2:28am

Oops, forgot to add, in order to use external libraries you can add sys.path.append() to point to a folder path where import can additionally look for libraries, instead of only the built-in Jython packages automatically included with OpenRefine.

This wiki page describes a bit of how to do that:
Extending Jython with pypi modules · OpenRefine/OpenRefine Wiki (github.com)

Topic		Replies	Views
Dario Schiraldi : How do I clean or format messy data? Community	1	29	July 9, 2025
Clean data as a base for dirty data Data cleaning and transformations	3	331	April 30, 2023
About the Data cleaning and transformations category Data cleaning and transformations	0	278	October 22, 2022
Video: Intro to OpenRefine for Data Cleaning and Reconciliation (Martin Magdinier) OpenRefine documentation	1	475	February 23, 2024
Anyone know a way to clean symbols that have replaced accents in datasets first names? Support and Helpdesk	6	385	May 7, 2024

Can I use OpenRefine to clean up first and last names that are not valid, such as random words?

Related topics