Anyone know a way to clean symbols that have replaced accents in datasets first names?

Sorry if this is covered on here but couldn't see it. Does anyone know if it's possible with Open refine to clean and standardize first names in my marketing data? A lot of the accents on names have been replaced with symbols.

Hi @DVCEC . This usually happens when the text has been interpreted using the wrong character encoding scheme. There are a couple of things you can try:

  1. When you first import the data to OpenRefine, try changing the "Character encoding" setting - Note that this setting is not available for all import formats - but if you are importing from a text file (e.g. csv) then you can set this
  2. Once the data has been imported to OpenRefine, try using the GREL function reinterpret. This tries to reinterpret a piece of text via a different character encoder. The list of available character encoders is available at Supported Encodings

For 2 you'd do something like:

  • Choose Edit cells -> Transform... from the column drop down menu
  • In the GREL expression use
    value.reinterpret("utf8")
    But using the character encoder you want to use.

I've tended to find fixing these problems on import more reliable than the reinterpret approach, but both can work. The really tricky thing is knowing what character encoding was used originally and therefor which character encoding scheme to specify either on import or in the reinterpret function.

Sometimes this is just trial and error - there are the more common schemes (UTF-8, ISO8859-1,windows-1252) which you can start with, but if the file was created on an obscure or old piece of software or operating system then it can be difficult to know what was used.

If you can post any more information about the file format or how the data was created people may have further insights.

Best wishes,

Owen

1 Like

Oh - and for standardisation, definitely take a look at the Clustering tool in OpenRefine - this is designed to help standardise things like peoples names.

1 Like

@ostephens Thanks so much for the advice, really appreciate the help! Will give this a try later on today

I second the recommendation to try and fix these as far upstream as possible - import time or even earlier, if necessary. A number of forms of corruption are irreversible (e.g. if the characters have been replaced with the error replacement character).

Tom

1 Like

Thanks for the great help! The transform function has worked wonders on most of the data where the encoding is the problem which is awesome. The standardization tool has also helped with the format of the text and capitals at the start which is going to save a load of time! :slight_smile:

Do have a fair bit to do manually but the cluster tool makes that super easy to see what errors are there for any and change them quickly.

Have just invested in a tool to identify and rectify the data ahead of the import into my CRM. Again thanks for the help, wish i knew about this years ago haha

1 Like