Hello! I am working in version 3.9.3 and wishing for improved functionality with the "Columnizing by Key/Value Columns" option. I have a large and complex dataset, so I will simplify it for illustrative purposes. I am working with data that looks like this:
ID
NAME
Name Source
Place Fields
Place Values
1
Haymarket Theatre (theater in London, England, built in 1720)
local
place_id
1264
Haymarket Theatre (London, England)
LCNAF
place_name
London
place_type
Inhabited place
place_tgn
7011781
I want the result to look like this:
ID
NAME
Name Source
place_id
place_name
place_type
place_tgn
1
Haymarket Theatre (theater in London, England, built in 1720)
local
1264
London
Inhabited place
7011781
Haymarket Theatre (London, England)
LCNAF
However, the actual result looks like this:
ID
NAME
Name Source
place_id
place_name
place_type
place_tgn
1
Haymarket Theatre (theater in London, England, built in 1720)
local
1264
Haymarket Theatre (London, England)
LCNAF
London
Inhabited place
7011781
From what I understand, this because in the original dataset, the values in the columns other than Place Fields and Place Values in the first two rows are distinct, whereas in the second two rows they are empty, and therefore identical. Is there a way around this? Thanks so much for the help!
thank you for the explicit example, which really helps to quickly identify the problem.
What you describe is a common problem when using "Columnizing by Key/Value Columns".
Maybe we should make this more explicit in the docs Transposing | OpenRefine.
The "problem" here is, that OpenRefine is using the NAME column as ids for the transposing.
As some cells in the NAME columns are empty, they are organized with "empty" as key.
Meaning they are in a separate row and not together with their original record.
The solution for this is to use Fill down on the columns with empty cells that are not columnized (NAME and Name Source in your example).
Thanks so much for the response. This helps reduce the number of extra rows, but unfortunately, still does not resolve the core of the problem. The example data which I provided was one Record out of 856 in the project. Each of these Records corresponds to a different Venue, each of which can have any number of alternative names, As such, the two alternative names in the Record I provided, "Haymarket Theatre (theater in London, England, built in 1720)" and "Haymarket Theatre (London, England)" both apply to the same Venue, as does all of the information in the Place Fields and Place Values columns. When I fill down, it eliminates one extraneous row from each record, but the columnized data is still split amongst the two rows of the Record, as the values in the first two rows differ in the other columns of the record.. I imagine that the only way to fix this would be to somehow disregard the columns in the project that don't contains Fields or Values while columnizing, which I assume is not possible, so this might be a fruitless effort, but I thought I would ask if there was a more clever workaround than I have been able to think of, or perhaps if this is something that might change in a future release of OpenRefine. Thanks again for all of your help!
I think you can do the transpose step as currently and then on each of the columns created by the transpose use Edit cells -> Join multi-valued cells - this will bring the value in the column into the first cell of the record (the record being defined by the ID column)
This option becomes cumbersome if you have many new columns created by the transpose, as you have to do the "join" step for each column. But if it is a handful or so then this won't take long.
If this doesn't work, I have a potential alternative approach I'm working on, but its looking a bit complicated, so in the first place I'd suggest the "join multi-valued cells" approach. However if that doesn't work let me know what the issues are and I can see if my other more complicated approach might be applied
Hi!
This worked for me, thanks so much! I had been hesitant to use this approach because I had several Records in the project that had multiple Places associated with the Venue, meaning that the desired result would have more than one row of values in the key columns. However, it occurred to me that after joining multi-valued cells, I could easily split them again, and that worked great. It would still be nice to have a way to disable the use of other columns as keys when trying to columnizing in this way, but this works perfectly well for now. Thanks again for all of your help!