Openrefine regex not matching expected handling of question mark in character class

Arnaud · March 26, 2024, 10:38am

Is this a bug or a feature?

I need to extract IDs values from columns values:
[Row 1:]SID: S278;Leading VS: DAT;
[Row 2:]SID: S88; S89;Leading VS: SC
[Row 3:];SID: ;S201, S311?, S315?; ;Leading VS: SC;

I created this in GREL:
value.find(/\bSID[:S?;,0-9\s]+/).join(";")

Result:
[Row 1:]SID: S278;
[Row 2:]SID: S88; S89;
[Row 3:];SID:

According to regexr.com, the regex should match "SID: ;S201, S311?, S315?; ;" in row 3, but it only finds "SID:".

Is it an issue interpreting the question mark in a character class? I tried some workarounds to have the question mark outside a character class (and escaped with ), but with no luck.

timtom · March 26, 2024, 12:57pm

That's strange. I'm able to replicate your situation exactly and it's returning what you're expecting:

(OpenRefine v. 3.7.9 on macOS)

I'm wondering if there's maybe some weird invisible character in your string after "SID:"?

timtom · March 26, 2024, 1:00pm

Unless the ";" in your example is a column delimiter? In which case, the "SID" column is indeed empty because there's a ";" before "S201".

Arnaud · March 26, 2024, 2:03pm

That was issue, there is \u00A0 in that string. Thank you!

tfmorris · March 26, 2024, 6:29pm

If you exchange your expression to

value.find(/(?U)\bSID[:S?;,0-9\s]+/).join(";")

I think it will likely work. We discussed in the past making the Unicode flag be on by default. Perhaps it's time to reconsider that argument.

Tom

Arnaud · March 27, 2024, 9:42am

It works and feels better than adding the invisible char in regex character class. Thanks!

timtom · March 28, 2024, 2:46pm

That's indeed a better answer than mine, I think you should pick that one as a solution instead!

Topic		Replies	Views
Detecting patterns in a text mass Data cleaning and transformations	2	183	December 16, 2023
Playing a bit with ChatGPT to make my life easier - I think it can help (?) users generate regex, GREL...? Development & Design	4	1291	March 24, 2023
Help me with the logic on this one. == instead of && in Expression Support and Helpdesk	3	34	July 26, 2024
Trouble with SQL export Support and Helpdesk	11	343	December 11, 2024
Split multi-valued cells Data cleaning and transformations hints-and-tips	6	335	May 17, 2023

Openrefine regex not matching expected handling of question mark in character class

Related topics