I've hit a data cleaning problem that I thought would have a simple solution in OpenRefine but I failed to find it. Probably I'm just not seeing the way, so I'm hoping someone here can point me in the right direction.
My dataset is a table (which came to me as a CSV) with one 'key' column, and then a bunch of attributes that apply to each key (with many cells left blank if the attribute was unknown). The problem is that the creator consolidated into a single row some records that have identical attributes. I've made an illustrative example here.
ORSplitTest.csv (136 Bytes)
In this table my key column is 'Name', and notice that the first row has value 'Chuck,Pat'. I want to fix the table so that this becomes two rows that have identical cell values except that one has Name:Chuck and the other has Name:Pat. I went for the obvious "Split multi-valued cells" function on the Name column, and this does create the new rows I need, but they start off blank in all but the key cell. In an attempt to copy the attribute values from the top-most row of each split set I tried "Fill down" on all columns, but since the data has many blanks cells, I don't see a way to stop the filling process from affecting other rows (i.e. rows that didn't result from the split). The only workaround I could find was to fill in all blanks, before the multi-valued cell split, with a distinctive string (like '%BLANK', say) so that the "Fill down" operation would only populate the split rows (as they're the only blanks now), then clean out the distinctive string (since I want the cells blank). But this seems a bit cumbersome, doesn't it?
You can create an Custom text facet to filter out rows that do not need splitting. In the facet you can set the expr as value.contains(",")
There are probably a few ways of doing this, but the one that strikes me as the most obvious and efficient is:
- Add an additional column at the start of the project with a record identifier. This could just be a copy of the Name column, or a new column using
rowIndex+1
or similar to just number the rows - Split multil-valued cells on the Name column using the comma as the separator
- Make sure you are in records mode - now the new rows you've made with split multi-valued are part of the same record, so you can do All -> Edit columns -> Fill Down to fill in the columns. In records mode the fill down should stop at the record boundary
- Remove the record identifier column you added
Why not:
- select all rows containing a comma with faceting;
- add a special character to all the selected row;
- split these rows based on this character;
- in records mode, fill down the values of the other columns;
- last, split the values of the first column based on the comma.
Sorry, meant text filter not faceting.
@ostephens has a more elegant solution as always!
Ah, thank you all - this support space is awesome! Yes multi-value splitting (and then filling) while in records mode is the trick I was missing. @ostephens, I do still find it a little cumbersome to make that additional column - if I'm not mistaken, you have to "Add column based on this column" and then "Move column to the beginning" (unless you've just duplicated the first column in which case you can rename the new one after the split/fill). Should we consider adding a feature like "Add column..." under All > Edit Columns? This would be the same as "Add column based on this column..." except that, in the dialog, it does not start with value
in the expression window (it could start with rowIndex
instead), it doesn't have a "value" column in the preview window, and it creates the new column at the beginning (or alternatively, we could make the value
for the "All" column implicitly the row index). Since that first column is so important for the records mode, it's maybe a bit odd that there is no quick way to make a new one.
Nice idea but it doesn't fully work. Splitting will only affect the rows that need it. With this filter I could shelter the rows that I don't want to be affected by the 'fill down' process (which, I'm guessing, is what you're suggesting this for), but this still doesn't stop the 'fill down' from affecting consecutive split rows - i.e. after the fill I would find TRUE in the DrinksTea column for Kevin and Barry.
I agree. I think this would be quite easy to add - I'll try to write up a github enhancement for it