Openrefine regex not matching expected handling of question mark in character class

Is this a bug or a feature?

I need to extract IDs values from columns values:
[Row 1:]SID: S278;Leading VS: DAT;
[Row 2:]SID: S88; S89;Leading VS: SC
[Row 3:];SID: ;S201, S311?, S315?; ;Leading VS: SC;

I created this in GREL:
value.find(/\bSID[:S?;,0-9\s]+/).join(";")

Result:
[Row 1:]SID: S278;
[Row 2:]SID: S88; S89;
[Row 3:];SID:

According to regexr.com, the regex should match "SID: ;S201, S311?, S315?; ;" in row 3, but it only finds "SID:".

Is it an issue interpreting the question mark in a character class? I tried some workarounds to have the question mark outside a character class (and escaped with ), but with no luck.

That's strange. I'm able to replicate your situation exactly and it's returning what you're expecting:


(OpenRefine v. 3.7.9 on macOS)

I'm wondering if there's maybe some weird invisible character in your string after "SID:"?

Unless the ";" in your example is a column delimiter? In which case, the "SID" column is indeed empty because there's a ";" before "S201".

That was issue, there is \u00A0 in that string. Thank you!

1 Like

If you exchange your expression to

value.find(/(?U)\bSID[:S?;,0-9\s]+/).join(";")

I think it will likely work. We discussed in the past making the Unicode flag be on by default. Perhaps it's time to reconsider that argument.

Tom

3 Likes

It works and feels better than adding the invisible char in regex character class. Thanks!

That's indeed a better answer than mine, I think you should pick that one as a solution instead!

1 Like