GREL replace() using regular expression

goyosoyo · June 22, 2024, 11:18pm

I need it to take the result of a regular expression and pipe it into the toUppercase function. I'm new to GREL and can't figure it out. Here's the use case.

Using Common Transformations > To titlecase to clean up mixed-case names works well except on hyphenated last names which get transformed to something like this

John-doe

I created a regex expression to find the character after the hyphen but I can't figure how how to use the GREL replace() function to transform the result of the regex into an uppercase letter. The function

value.replace(/(?<=[-])./,"X")

works as expected and transforms John-doe into John-Xoe

But when I try something like this

value.replace(/(?<=[-])./, toUppercase())

the GREL function toUppercase doesn't know what result to transform. It's expecting a string. How is this done?

thadguidry · June 23, 2024, 12:54am

Welcome Greg @goyosoyo !
I'll update our docs, because it looks like we're still missing the delimiter argument that you can pass into toTitlecase(string, delimiter). Sorry about that!

Try:

value.toTitlecase(" -")

which will capitalize letters appearing after spaces and after a hyphen.

Add any more delimiter chars as necessary, and you might be careful and want to do this in combination with a Facet that narrows or filters to only rows where your cells values contain the delimiters, such as each one in the set of -'_+ perhaps depending on what's in your cells values. Update a Text Filter or Facet and transform cells with toTitlecase() with the necessary individual delimiter. Rinse and repeat until you've covered the set of delimiters for your cells values that you want values to be converted to titlecase.

goyosoyo · June 23, 2024, 5:24am

Thanks for the warm welcome. A cleaner, simpler solution is always welcome. Thank you. Greg

DimasJusto · July 18, 2024, 8:51pm

Hello.
My situation is very similar to @goyosoyo
I've already done dozens of searches in the OpenRefine documentation, on Google, etc.
On ChatGPT I made about 15 prompts and nothing.
I deal reasonably with Regex, but in OpenRefine I can't change the case of a string.

I need to change the text
Tribunal Regional do Trabalho - Trt
to
Tribunal Regional do Trabalho - TRT

Could anyone give some direction for this?
Thanks

HannaMeiners · July 19, 2024, 6:50am

Hi,
I think it should work with:

value.replace("Tribunal Regional do Trabalho - Trt","Tribunal Regional do Trabalho - TRT")

Best!

Nico_Altink · July 19, 2024, 8:21am

I guess there are multiple cases you want to handle, ChatGPT advised me for making the last "word" in uppercase with:
value.split(" ").slice(0, -1).join(" ") + " " + value.split(" ").slice(-1)[0].toUpperCase()

ostephens · July 19, 2024, 8:55am

When you are trying to solve this type of problem you need to give some thought to what the general patterns are across your complete dataset - because the challenge is usually not changing a single value, but changing all the similar variations in your dataset.

If it is really just this one phrase repeated across your dataset then it could be done by creating a text facet (Exploring facets | OpenRefine) and doing the edit in the facet (editing a facet value changes that value for all cells in the column), or by the suggestion from @HannaMeiners

However, if you have different variations of values where the pattern is:

<full name> - <abbreviation>

and you need the to always be in uppercase, then an approach like @Nico_Altink suggests would be better. The expression Nico suggests from ChatGPT is OK, but there are two issues with it:

a minor issue: there is a typo - it needs to be toUppercase() not toUpperCase() GREL functions | OpenRefine
a bigger issue: it is a very general expression which always converts the last word to uppercase which could result in making changes where they aren't needed

Assuming the pattern is
<full name> - <abbreviation>

I'd suggest

value.split(" - ")[0]+" - "+value.split(" - ")[1].toUppercase()

or to avoid a little repetition you can use the with() control

with(value.split(" - "),a,a[0]+" - "+a[1].toUppercase())

Nico_Altink · July 19, 2024, 10:00am

Thanks for correcting ChatGPT, as indeed I forgot to test the expression.
However being a bit picky I can also discuss your " - " split, as it will fail when there's no "-" separator, thus my choice to use the last space as separator of a "word" was regarded more robuste.
I do have a similar function coded in VBA that's based on showing just the capitals in the selected text.
In OpenRefine this could be done by first splitting the column text by your " - " separator and then extract the capitals from the intended text. This does give for free an additional test for description without the "-" separator.

ostephens · July 19, 2024, 1:47pm

I agree - if your requirement is to capitalise the last word then the split function needs to match this purpose and the option you give would work here.

My point of it being general is if (for example) you had a mixture of values like

Tribunal Regional do Trabalho - Trt
Tribunal Regional do Trabalho

then in that case you don't want to capitalise the last word for the second line but do something else. This is what I meant by saying the ChatGPT solution here

is a very general expression which always converts the last word to uppercase which could result in making changes where they aren't needed

But there are many approaches and many variations you can take here. This is why I say first:

When you are trying to solve this type of problem you need to give some thought to what the general patterns are across your complete dataset - because the challenge is usually not changing a single value, but changing all the similar variations in your dataset.

Without seeing the whole data set it is not really possible to say which approach is the right one here - this is something @DimasJusto will need to work out I think

DimasJusto · July 23, 2024, 1:20pm

Hi HannaMeiners.
I ended up forgetting to write that this is just an example. I have around 1,350 occurrences with different texts.
I resolved the situation by exporting to .tsv and dealing with regex in Notepad ++.
Thank you for your contribution.

Best.

ostephens · July 23, 2024, 1:24pm

Hi @DimasJusto. Just in case you hadn’t realised you can use regular expressions in OpenRefine (in the replace and other functions)

It would be interesting to know why you decided it was easier to do outside OpenRefine in the end as we’re always interested in how we can improve OpenRefine

Topic		Replies	Views
Question about value.replace problem Data cleaning and transformations	4	365	March 26, 2024
Playing a bit with ChatGPT to make my life easier - I think it can help (?) users generate regex, GREL...? Development & Design	4	1295	March 24, 2023
How much data can OpenRefine handle? Support and Helpdesk faq	7	1013	November 8, 2023
Creating a conditional (if-then) regex Support and Helpdesk	4	566	June 12, 2023
Possible error in GREL documentation for cell reference Support and Helpdesk	6	130	June 5, 2024

GREL replace() using regular expression

Related topics