"split multi-valued cells" seems to affect non-matching rows

Hi, community! In my dataset, I have several records that include appropriately multi-valued cells, with values all delimited by " | ". For just one of those records, though, I have a reason to split it out into two single-valued records, each with their own row. I've faceted so that this row is the only one that matches, but when I do "split multi-valued cells," OpenRefine seems to be splitting values elsewhere, too.

As the attached screenshots demonstrate, I start out with the same number of records and rows. When I then split one column that has two values, the record correctly shows two rows – but the total number of rows goes up by five instead of by one.

Am I misunderstanding the scope of this operation? I had thought all operations only acted on rows matching the current filter.

Since it's only one row, I can just export to Excel and add the row manually, but it's still weird to me, and if it's a bug, I'd want to make sure the developers know about it so it doesn't mess anything up on a larger scale.

Thanks for your help!

1 Like

Hi @benmiller314, welcome to the forum and sorry we have not answered earlier.
I think this is a very interesting problem you have, and I'd be curious to look into it and check if the new architecture I am working on fixes it. It does not sound like the easiest thing to reproduce so in case you'd be able to share an export of your project (even privately) it would likely be very helpful.

Am I misunderstanding the scope of this operation? I had thought all operations only acted on rows matching the current filter.

There are a number of operations which aren't affected by facets. Some are obvious (e.g. pure column operations like rename, reorder, etc), some slightly less so (e.g. row reorder, aka "make sort permanent") and some not obvious at all (e.g. split/join multivalued cells). This has been the behavior for over a dozen years, but it might not be well documented. Without thinking deeply about it, I'm not sure if the split/join operations were done this way simply for ease of implementation or whether there are some hidden gotchas with having them honor facets. On the surface it seems like it should be tractable with a little extra bookkeeping to ignore newly created/deleted rows when iterating.

As a workaround, you could use your facets to select the appropriate rows, transform them to replace the separator character with something unique and then do the split on the new separator character which only exists in the desired rows.

Tom
p.s. Thank you for providing data to reproduce the problem with the Github issue that you created. That always makes things easier to test.

pps Here's the full list of operations which are NOT affected by facets:

  • Column - Move, Remove, Rename, Reorder
  • Denormalize (which isn't exposed anywhere in the UI that I can find)
  • MultiValuedCell - Split, Join
  • RowReorder (ie Sort -> Reorder rows permanently)
  • Transpose - Columns to Rows, Rows to Column, KeyValueColumnize
2 Likes

Thanks for these clarifications and for the workaround with an alternate separator! I should have thought of that, which has the benefit of scaling more readily (and preserving history) compared to the export/re-import I'd been worried I'd have to do.

For convenience, here's a link to the related GitHub repo, which has now been scoped toward documenting rather than changing the behavior – which, as Tom pointed out, has been the behavior for a very long time, and I'd just never had reason to notice it before.

1 Like