"split multi-valued cells" seems to affect non-matching rows

benmiller314 · July 14, 2023, 4:16pm

Hi, community! In my dataset, I have several records that include appropriately multi-valued cells, with values all delimited by " | ". For just one of those records, though, I have a reason to split it out into two single-valued records, each with their own row. I've faceted so that this row is the only one that matches, but when I do "split multi-valued cells," OpenRefine seems to be splitting values elsewhere, too.

As the attached screenshots demonstrate, I start out with the same number of records and rows. When I then split one column that has two values, the record correctly shows two rows – but the total number of rows goes up by five instead of by one.

Am I misunderstanding the scope of this operation? I had thought all operations only acted on rows matching the current filter.

Since it's only one row, I can just export to Excel and add the row manually, but it's still weird to me, and if it's a bug, I'd want to make sure the developers know about it so it doesn't mess anything up on a larger scale.

Thanks for your help!

antonin_d · July 17, 2023, 6:55pm

Hi @benmiller314, welcome to the forum and sorry we have not answered earlier.
I think this is a very interesting problem you have, and I'd be curious to look into it and check if the new architecture I am working on fixes it. It does not sound like the easiest thing to reproduce so in case you'd be able to share an export of your project (even privately) it would likely be very helpful.

tfmorris · July 17, 2023, 8:13pm

Am I misunderstanding the scope of this operation? I had thought all operations only acted on rows matching the current filter.

There are a number of operations which aren't affected by facets. Some are obvious (e.g. pure column operations like rename, reorder, etc), some slightly less so (e.g. row reorder, aka "make sort permanent") and some not obvious at all (e.g. split/join multivalued cells). This has been the behavior for over a dozen years, but it might not be well documented. Without thinking deeply about it, I'm not sure if the split/join operations were done this way simply for ease of implementation or whether there are some hidden gotchas with having them honor facets. On the surface it seems like it should be tractable with a little extra bookkeeping to ignore newly created/deleted rows when iterating.

As a workaround, you could use your facets to select the appropriate rows, transform them to replace the separator character with something unique and then do the split on the new separator character which only exists in the desired rows.

Tom
p.s. Thank you for providing data to reproduce the problem with the Github issue that you created. That always makes things easier to test.

tfmorris · July 17, 2023, 8:27pm

pps Here's the full list of operations which are NOT affected by facets:

Column - Move, Remove, Rename, Reorder
Denormalize (which isn't exposed anywhere in the UI that I can find)
MultiValuedCell - Split, Join
RowReorder (ie Sort -> Reorder rows permanently)
Transpose - Columns to Rows, Rows to Column, KeyValueColumnize

benmiller314 · July 19, 2023, 3:01pm

Thanks for these clarifications and for the workaround with an alternate separator! I should have thought of that, which has the benefit of scaling more readily (and preserving history) compared to the export/re-import I'd been worried I'd have to do.

For convenience, here's a link to the related GitHub repo, which has now been scoped toward documenting rather than changing the behavior – which, as Tom pointed out, has been the behavior for a very long time, and I'd just never had reason to notice it before.

Topic		Replies	Views
Two ways of feature extraction: 1. split multi-value cells; 2. split into multiple columns Data cleaning and transformations	6	1216	March 24, 2023
Duplicating rows problem Support and Helpdesk	9	50	June 11, 2025
Joining Multi Value Cells Not Completing Data cleaning and transformations	6	402	November 7, 2023
Convert your rows of data into multi-row records Support and Helpdesk	4	340	January 20, 2024
Multiple values in one cell during reconcile Support and Helpdesk	7	136	June 30, 2024

"split multi-valued cells" seems to affect non-matching rows

Related topics