Openrefine regex not matching expected handling of question mark in character class

Is this a bug or a feature?

I need to extract IDs values from columns values:
[Row 1:]SID: S278;Leading VS: DAT;
[Row 2:]SID: S88; S89;Leading VS: SC
[Row 3:];SID: ;S201, S311?, S315?; ;Leading VS: SC;

I created this in GREL:

[Row 1:]SID: S278;
[Row 2:]SID: S88; S89;
[Row 3:];SID:

According to, the regex should match "SID: ;S201, S311?, S315?; ;" in row 3, but it only finds "SID:".

Is it an issue interpreting the question mark in a character class? I tried some workarounds to have the question mark outside a character class (and escaped with ), but with no luck.

That's strange. I'm able to replicate your situation exactly and it's returning what you're expecting:

(OpenRefine v. 3.7.9 on macOS)

I'm wondering if there's maybe some weird invisible character in your string after "SID:"?

Unless the ";" in your example is a column delimiter? In which case, the "SID" column is indeed empty because there's a ";" before "S201".

That was issue, there is \u00A0 in that string. Thank you!

If you exchange your expression to


I think it will likely work. We discussed in the past making the Unicode flag be on by default. Perhaps it's time to reconsider that argument.



It works and feels better than adding the invisible char in regex character class. Thanks!

That's indeed a better answer than mine, I think you should pick that one as a solution instead!

