I'm working with a dataset where some of the entries in the first and last name columns are not actual names, but rather random words that shouldn't be there. I'm considering using OpenRefine to clean up these entries and was wondering if anyone has experience with this. Specifically, I'd like to know:
Can OpenRefine effectively identify and correct entries that are clearly not names?
What features or techniques within OpenRefine would be most useful for removing non-name entries and ensuring that only valid names are retained?
Any guidance or tips on how to best approach this using OpenRefine would be greatly appreciated. Thank you in advance!
You can definitely do that with OpenRefine, but it kind of depends on how much "magic" you expect.
OpenRefine provides so called Facets to quickly get an overview of all the values in a column. In your case you would be able to scroll through a list of all the (unique) names in your name column (either sorted alphabetically or based on number of occurrences).
You can then use the Facet to either filter the dataset or directly (mass) edit the values.
Depending on how the "random" words where introduced into the dataset you might identify a pattern to somehow identify clusters of random words (like they all are lowercase, have numbers in them, have a certain length...).
In case you identify such a pattern you can combine Text filters with Facets to quickly drill down to the affected rows and then clean them.
In my experience this works quite well for "smaller" datasets, but becomes quite straining when you have more than 5k unique values in a column.
Depending on your dataset it might be worth a shot to try to Reconcile the names against Wikidata (or some other data source) and then only check the rows without a reconciliation result. But this works only, if the names in your dataset are likely to be found in a open data source like Wikidata.
Thank you for the suggestions on using OpenRefine. I appreciate the insight into using Facets and Text filters, which seem very useful for managing smaller datasets.
However, in my case, I'm dealing with a column that contains both first and last names, and occasionally, random words are inserted instead of proper names. Because these random entries could be any word, filtering becomes a bit more complex as they don't follow a consistent pattern. They could be uppercase, lowercase, contain numbers, or be of various lengths, making it difficult to identify them based solely on those attributes.
Given this, I might have to apply a filter to all the words in the column to ensure I can identify and manage these random entries effectively. This approach might be necessary because the random words don’t correspond to typical name entries and could be literally anything, which complicates using standard reconciliation services like Wikidata since these random words aren't likely to match any reliable external data source.
Thanks again for your help. I'll continue exploring OpenRefine's features to see if I can tailor them more closely to the unique challenges of my dataset.
One way I've approached this before is to use a Custom Facet and our cross() function to return if a human first name/last name exists in the other projects column.
But, more advanced, you can also (if giving OpenRefine enough memory ~8G) load the entire name-dataset library and use our Jython/Python to perform the lookups more directly using Python's in or not in operators which return True or False if a name or string exists in the dictionaries provided by the name-dataset library nd.first_names and nd.last_names
Something like:
from names_dataset import NameDataset, NameWrapper
return value in nd.first_names
This is a way to perform local reconcile string matching of most of the names around the world.
Oops, forgot to add, in order to use external libraries you can add sys.path.append() to point to a folder path where import can additionally look for libraries, instead of only the built-in Jython packages automatically included with OpenRefine.