Anyone know a way to clean symbols that have replaced accents in datasets first names?

DVCEC · November 23, 2023, 4:24pm

Sorry if this is covered on here but couldn't see it. Does anyone know if it's possible with Open refine to clean and standardize first names in my marketing data? A lot of the accents on names have been replaced with symbols.

ostephens · November 23, 2023, 5:10pm

Hi @DVCEC . This usually happens when the text has been interpreted using the wrong character encoding scheme. There are a couple of things you can try:

When you first import the data to OpenRefine, try changing the "Character encoding" setting - Note that this setting is not available for all import formats - but if you are importing from a text file (e.g. csv) then you can set this
Once the data has been imported to OpenRefine, try using the GREL function reinterpret. This tries to reinterpret a piece of text via a different character encoder. The list of available character encoders is available at Supported Encodings

For 2 you'd do something like:

Choose Edit cells -> Transform... from the column drop down menu
In the GREL expression use
value.reinterpret("utf8")
But using the character encoder you want to use.

I've tended to find fixing these problems on import more reliable than the reinterpret approach, but both can work. The really tricky thing is knowing what character encoding was used originally and therefor which character encoding scheme to specify either on import or in the reinterpret function.

Sometimes this is just trial and error - there are the more common schemes (UTF-8, ISO8859-1,windows-1252) which you can start with, but if the file was created on an obscure or old piece of software or operating system then it can be difficult to know what was used.

If you can post any more information about the file format or how the data was created people may have further insights.

Best wishes,

Owen

ostephens · November 23, 2023, 5:13pm

Oh - and for standardisation, definitely take a look at the Clustering tool in OpenRefine - this is designed to help standardise things like peoples names.

DVCEC · November 24, 2023, 8:17am

@ostephens Thanks so much for the advice, really appreciate the help! Will give this a try later on today

tfmorris · November 24, 2023, 8:13pm

I second the recommendation to try and fix these as far upstream as possible - import time or even earlier, if necessary. A number of forms of corruption are irreversible (e.g. if the characters have been replaced with the error replacement character).

Tom

DVCEC · November 27, 2023, 3:06pm

Thanks for the great help! The transform function has worked wonders on most of the data where the encoding is the problem which is awesome. The standardization tool has also helped with the format of the text and capitals at the start which is going to save a load of time!

Do have a fair bit to do manually but the cluster tool makes that super easy to see what errors are there for any and change them quickly.

Have just invested in a tool to identify and rectify the data ahead of the import into my CRM. Again thanks for the help, wish i knew about this years ago haha

Antoine2711 · May 7, 2024, 6:49pm

Hi @tfmorris, I have some accent that are a concatenation of a letter + the accent, instead of the accentuated letter.
value.reinterpret("utf8") doesn’t rewrite them. Do you know another GREL function that would do that?

Regards, Antoine

Topic		Replies	Views
[openrefine 3.5.2 - win64] Locked out Support and Helpdesk	1	240	February 7, 2023
Can I use OpenRefine to clean up first and last names that are not valid, such as random words? Data cleaning and transformations	4	153	May 20, 2024
Dario Schiraldi : How do I clean or format messy data? Community	1	28	July 9, 2025
About the Data cleaning and transformations category Data cleaning and transformations	0	278	October 22, 2022
Invalid json error for a previously used json file Support and Helpdesk	6	42	June 12, 2025

Anyone know a way to clean symbols that have replaced accents in datasets first names?

Related topics