Arnaud
March 26, 2024, 10:38am
1
Is this a bug or a feature?
I need to extract IDs values from columns values:
[Row 1:]SID: S278;Leading VS: DAT;
[Row 2:]SID: S88; S89;Leading VS: SC
[Row 3:];SID: ;S201, S311?, S315?; ;Leading VS: SC;
I created this in GREL:
value.find(/\bSID[:S?;,0-9\s]+/).join(";")
Result:
[Row 1:]SID: S278;
[Row 2:]SID: S88; S89;
[Row 3:];SID:
According to regexr.com , the regex should match "SID: ;S201, S311?, S315?; ;" in row 3, but it only finds "SID:".
Is it an issue interpreting the question mark in a character class? I tried some workarounds to have the question mark outside a character class (and escaped with ), but with no luck.
timtom
March 26, 2024, 12:57pm
2
That's strange. I'm able to replicate your situation exactly and it's returning what you're expecting:
(OpenRefine v. 3.7.9 on macOS)
I'm wondering if there's maybe some weird invisible character in your string after "SID:"?
timtom
March 26, 2024, 1:00pm
3
Unless the ";" in your example is a column delimiter? In which case, the "SID" column is indeed empty because there's a ";" before "S201".
Arnaud
March 26, 2024, 2:03pm
4
That was issue, there is \u00A0 in that string. Thank you!
1 Like
If you exchange your expression to
value.find(/(?U)\bSID[:S?;,0-9\s]+/).join(";")
I think it will likely work. We discussed in the past making the Unicode flag be on by default. Perhaps it's time to reconsider that argument.
Tom
3 Likes
Arnaud
March 27, 2024, 9:42am
6
It works and feels better than adding the invisible char in regex character class. Thanks!
timtom
March 28, 2024, 2:46pm
7
That's indeed a better answer than mine, I think you should pick that one as a solution instead!
1 Like