Odd space issue causing blank down issues

Hi all,

I ran into a problem and wondering if anyone might know what’s going on. I’m trying to blank down a column and noticed that a series of terms didn’t match in the way I thought they would and realized that even though the string looks identical, there’s a different space character being used in the strings. So while the sorting ignores the different spaces, the blank down doesn’t. Here’s an example in my data from vs code with the white space rendered:

As you can see in the screenshot, there’s some non-standard space being used in the instances without the dot. I have no idea how this got introduced. I can do a find and replace, but I’m just wondering if anyone knows anything about this kind of thing.

Thanks,

Joseph Anderson

I think you’ve got the analysis correct. It seems similar to an issue reported at Non-breaking space problem · Issue #4425 · OpenRefine/OpenRefine · GitHub (this issue specifically mentions non-breaking spaces being a problem although I think there are number of characters that could be used as whitespace here, the non-breaking space is relatively common)

In that issue the user says they can resolve the immediate problem using value.replace(" "," ") - although there might be some slightly more comprehensive options like value.replace(/\s+/," ") or value.replace(/(?U)[\W]+/," ") that will catch a wider range of potential “space” characters or word separators that might be used

Ok, that makes more sense. The person that was working on this data set was doing clustering, and I wonder if somehow the non-breaking space was introduced by copying from the clustering interface, which seems to use non-breaking space in the link even if they’re not in the source string:

I wonder if somehow the non-breaking space was introduced by copying from the clustering interface, which seems to use non-breaking space in the link even if they’re not in the source string:

That's done to keep the strings from breaking and wrapping. That dialog is intended to be a visual display, not a datasource.

although there might be some slightly more comprehensive options like value.replace(/\s+/," ") or value.replace(/(?U)[\W]+/," ") that will catch a wider range of potential “space” characters or word separators that might be used

I recommend using /(?U)\s+/ for Unicode whitespace characters. The \W character class is all non-word characters which is going to include lots of non-whitespace word separators. This bigger hammer is good for things like splitting text into words, but probably too broad for simple whitespace replacement.

Tom

4 Likes

Just following up on this. I did confirm that the nbsp is being introduced though the normal clustering process. If during clustering, you select one of the options by clicking on the link (not just copying), it replaces all the original spaces with nbsp. Should I report this as a bug?

Thanks for creating the bug report. If anyone else wants to track the issue, you can find it here: https://github.com/OpenRefine/OpenRefine/issues/5581