Could Someone Give me Advice with Complex Data Transformation in OpenRefine?

Hello there,

I am working on a project using OpenRefine and have run into a bit of a challenge with data transformation. I am trying to clean and restructure a dataset that contains multiple columns with nested and inconsistent data formats.

I have columns that need to be merged based on a specific condition. For instance; I want to concatenate the values of two columns but only if a certain condition in a third column is met. Whats the best way to achieve this in OpenRefine?

Another part of the dataset has concatenated data in a single column that needs to be split into multiple columns based on delimiters. The delimiters are inconsistent; and I need a method to handle these variations effectively.

I also have nested data structures within some cells that need to be flattened out. For example; a column contains JSON like strings that I need to expand into separate columns.

Also, I have gone through this post; https://forum.openrefine.org/t/work-plan-for-the-reproducibility-improvements-project-ccsp/ which definitely helped me out a lot.

I have tried using some of the built in transformation functions but am struggling with the complexity. If anyone could provide guidance on how to use OpenRefines features to handle these transformations or point me towards relevant tutorials.

Thanks in advance for your help and assistance.

There are multiple ways of doing this, but in this case I'd typically:

  1. design a custom text facet for the third column that checked if the condition you mention is met.
  2. use this facet to filter the rows so that only rows that match this condition are shown
  3. Apply the option Edit column -> Join columns from the column dropdown menu

Because in OpenRefine operations and transformations are only applied the rows currently shown in the project, obeying all facets/filters applied, this will achieve your goal

1 Like

It's hard to give the right advice without concrete examples of how much the data varies in terms of its use of delimiters and the rest of the data in the cells, however, as a general approach you can use the Edit cells -> Split into several columns menu option.

In this function you can use a regular expression to express the delimiters, so if (for example) you have a situation where sometimes the delimiter is ~ and sometimes | you could use the regular expression:
[~|]
in the separator option, checking the "Regular expression" checkbox

The regular expression can contain lots of variations for delimiters, but this approach will only work effectively across the whole project if all the delimiters used never appear elsewhere in the cells. I mean, if you have cell values 1|2 and 3~4 the approach I describe is fine. But if you have 5~6|7~8 in another cell, this would fail as both delimiter characters are present, but you only want to split on one of them

You can refine (ahem) this approach by using an approach by using a text filter (or custom text facet) to first remove all rows that contain the multiple delimiter characters, and then do the the split, and then target the remaining rows using a different strategy - although this can be fiddly. If you have this scenario and can share some examples (either real or made up) which illustrate the data then it might be possible to give some further ideas on tackling

Again having concrete examples might help. "JSON like" and "JSON" are different so could take different approaches. But broadly, I'd be looking to transform the cell so that I have a consistent separator for each part of the value that I want in it's own column, and then using the Edit cells -> Split into several columns option to actually divide it up

For example if I start with:

[
  {"one" :1},
  {"two" : 2}
]

Then I could do a GREL transform (Edit cells -> Transform)

value.parseJson().join("|")

which would give the result

{"one":1}|{"two":2}

Then use the Split into several columns with the appropriate separator (| in this case)