Duplicating rows problem

jquartel · June 6, 2025, 11:47am

I've hit a data cleaning problem that I thought would have a simple solution in OpenRefine but I failed to find it. Probably I'm just not seeing the way, so I'm hoping someone here can point me in the right direction.
My dataset is a table (which came to me as a CSV) with one 'key' column, and then a bunch of attributes that apply to each key (with many cells left blank if the attribute was unknown). The problem is that the creator consolidated into a single row some records that have identical attributes. I've made an illustrative example here.
ORSplitTest.csv (136 Bytes)
In this table my key column is 'Name', and notice that the first row has value 'Chuck,Pat'. I want to fix the table so that this becomes two rows that have identical cell values except that one has Name:Chuck and the other has Name:Pat. I went for the obvious "Split multi-valued cells" function on the Name column, and this does create the new rows I need, but they start off blank in all but the key cell. In an attempt to copy the attribute values from the top-most row of each split set I tried "Fill down" on all columns, but since the data has many blanks cells, I don't see a way to stop the filling process from affecting other rows (i.e. rows that didn't result from the split). The only workaround I could find was to fill in all blanks, before the multi-valued cell split, with a distinctive string (like '%BLANK', say) so that the "Fill down" operation would only populate the split rows (as they're the only blanks now), then clean out the distinctive string (since I want the cells blank). But this seems a bit cumbersome, doesn't it?

Sunil_Natraj · June 6, 2025, 12:18pm

You can create an Custom text facet to filter out rows that do not need splitting. In the facet you can set the expr as value.contains(",")

ostephens · June 6, 2025, 12:26pm

There are probably a few ways of doing this, but the one that strikes me as the most obvious and efficient is:

Add an additional column at the start of the project with a record identifier. This could just be a copy of the Name column, or a new column using rowIndex+1 or similar to just number the rows
Split multil-valued cells on the Name column using the comma as the separator
Make sure you are in records mode - now the new rows you've made with split multi-valued are part of the same record, so you can do All -> Edit columns -> Fill Down to fill in the columns. In records mode the fill down should stop at the record boundary
Remove the record identifier column you added

archilecteur · June 6, 2025, 12:35pm

Why not:

select all rows containing a comma with faceting;
add a special character to all the selected row;
split these rows based on this character;
in records mode, fill down the values of the other columns;
last, split the values of the first column based on the comma.

archilecteur · June 6, 2025, 12:41pm

Sorry, meant text filter not faceting.

archilecteur · June 6, 2025, 12:43pm

@ostephens has a more elegant solution as always!

jquartel · June 6, 2025, 1:47pm

Ah, thank you all - this support space is awesome! Yes multi-value splitting (and then filling) while in records mode is the trick I was missing. @ostephens, I do still find it a little cumbersome to make that additional column - if I'm not mistaken, you have to "Add column based on this column" and then "Move column to the beginning" (unless you've just duplicated the first column in which case you can rename the new one after the split/fill). Should we consider adding a feature like "Add column..." under All > Edit Columns? This would be the same as "Add column based on this column..." except that, in the dialog, it does not start with value in the expression window (it could start with rowIndex instead), it doesn't have a "value" column in the preview window, and it creates the new column at the beginning (or alternatively, we could make the value for the "All" column implicitly the row index). Since that first column is so important for the records mode, it's maybe a bit odd that there is no quick way to make a new one.

jquartel · June 10, 2025, 7:54pm

Nice idea but it doesn't fully work. Splitting will only affect the rows that need it. With this filter I could shelter the rows that I don't want to be affected by the 'fill down' process (which, I'm guessing, is what you're suggesting this for), but this still doesn't stop the 'fill down' from affecting consecutive split rows - i.e. after the fill I would find TRUE in the DrinksTea column for Kevin and Barry.

ostephens · June 11, 2025, 10:19am

I agree. I think this would be quite easy to add - I'll try to write up a github enhancement for it

ostephens · June 11, 2025, 1:04pm

github.com/OpenRefine/OpenRefine

Support menu option to "Add ID column" to help with row->record conversion

opened 01:04PM - 11 Jun 25 UTC

ostephens

Type: Feature Request Status: Pending Review

When manipulating data in OpenRefine it is often useful to convert rows to recor…ds. When there isn't a natural "record ID" in the existing data, you have to add a new column with some unique value and move that column to be the first in the project. This takes several steps, and it would be convenient if there was a menu option that added such a column to a project without having to go through the individual steps manually. See mention in this forum thread https://forum.openrefine.org/t/duplicating-rows-problem/2360/8 ### Proposed solution A new menu item `Add ID column` could be added in `All` -> `Edit columns` which would add a new column at the start of the project (position 0) containing a unique value for each row ### Alternatives considered * Same approach but to have the item under the column drop down `Edit column` menu (i.e. repeated for each column) * Similar approach, but rather than automatically adding a unique value, have an expression editor step where the user could either accept the default value (an expression to add a system generated unique value) or edit the expression used if they have a specific unique value generation they wish to use * Rather than a specialist solution for adding an ID column at the start of the project, rather expand the `Add column based on this column` dialogue to include the position in which the new column should be added. This would be a more general solution, that would make adding a leading column easier and would allow (or force depending on how you look at it) the user to edit the the expression used to create a unique value ### Additional context https://forum.openrefine.org/t/duplicating-rows-problem/2360/8

Topic		Replies	Views
"split multi-valued cells" seems to affect non-matching rows Support and Helpdesk	4	387	July 19, 2023
Two ways of feature extraction: 1. split multi-value cells; 2. split into multiple columns Data cleaning and transformations	6	1243	March 24, 2023
Joining Multi Value Cells Not Completing Data cleaning and transformations	6	414	November 7, 2023
Convert your rows of data into multi-row records Support and Helpdesk	4	359	January 20, 2024
Merging and dedublicating Data cleaning and transformations	6	1083	December 2, 2022

Duplicating rows problem

Related topics