Odd space issue causing blank down issues

Joseph_Anderson · January 13, 2023, 4:47pm

Hi all,

I ran into a problem and wondering if anyone might know what’s going on. I’m trying to blank down a column and noticed that a series of terms didn’t match in the way I thought they would and realized that even though the string looks identical, there’s a different space character being used in the strings. So while the sorting ignores the different spaces, the blank down doesn’t. Here’s an example in my data from vs code with the white space rendered:

As you can see in the screenshot, there’s some non-standard space being used in the instances without the dot. I have no idea how this got introduced. I can do a find and replace, but I’m just wondering if anyone knows anything about this kind of thing.

Thanks,

Joseph Anderson

ostephens · January 13, 2023, 5:02pm

I think you’ve got the analysis correct. It seems similar to an issue reported at Non-breaking space problem · Issue #4425 · OpenRefine/OpenRefine · GitHub (this issue specifically mentions non-breaking spaces being a problem although I think there are number of characters that could be used as whitespace here, the non-breaking space is relatively common)

In that issue the user says they can resolve the immediate problem using value.replace(" "," ") - although there might be some slightly more comprehensive options like value.replace(/\s+/," ") or value.replace(/(?U)[\W]+/," ") that will catch a wider range of potential “space” characters or word separators that might be used

Joseph_Anderson · January 13, 2023, 5:26pm

Ok, that makes more sense. The person that was working on this data set was doing clustering, and I wonder if somehow the non-breaking space was introduced by copying from the clustering interface, which seems to use non-breaking space in the link even if they’re not in the source string:

tfmorris · January 13, 2023, 7:09pm

I wonder if somehow the non-breaking space was introduced by copying from the clustering interface, which seems to use non-breaking space in the link even if they’re not in the source string:

That's done to keep the strings from breaking and wrapping. That dialog is intended to be a visual display, not a datasource.

although there might be some slightly more comprehensive options like value.replace(/\s+/," ") or value.replace(/(?U)[\W]+/," ") that will catch a wider range of potential “space” characters or word separators that might be used

I recommend using /(?U)\s+/ for Unicode whitespace characters. The \W character class is all non-word characters which is going to include lots of non-whitespace word separators. This bigger hammer is good for things like splitting text into words, but probably too broad for simple whitespace replacement.

Tom

Joseph_Anderson · January 24, 2023, 8:02pm

Just following up on this. I did confirm that the nbsp is being introduced though the normal clustering process. If during clustering, you select one of the options by clicking on the link (not just copying), it replaces all the original spaces with nbsp. Should I report this as a bug?

tfmorris · January 25, 2023, 6:09pm

Thanks for creating the bug report. If anyone else wants to track the issue, you can find it here: https://github.com/OpenRefine/OpenRefine/issues/5581

Topic		Replies	Views
Question about value.replace problem Data cleaning and transformations	4	356	March 26, 2024
Cluster and edit function returns similar values Support and Helpdesk	3	70	June 12, 2024
Combining expressions Support and Helpdesk hints-and-tips	7	58	September 19, 2024
Replace line breaks with spaces Support and Helpdesk	2	282	January 3, 2024
Bug Report: Issue with "Select column - Edit cells-Fill spaces with duplicates" Functionality in OpenRefine Support and Helpdesk	1	380	June 19, 2023

Odd space issue causing blank down issues

Related topics