Two ways of feature extraction: 1. split multi-value cells; 2. split into multiple columns

Hi, data cleaning experts and OpenRefine enthusiasts,

I have one question related to data transformations on OpenRefine, which are used to extract data values from composite values across a column.

Method I Split multi-value cells
At the row level, the data transformation we could use is “split multi-value cells,” in which by inputting separator, OpenRefine helps split the composite data values, creates new empty rows, and fulfill each individual cell values.

| id     | place              |
| ----- -| -------------------|
| 111   | Urbana,IL           |
| 112   | New york city, NY   |

=>

| id     | place              |
| ----- -| -------------------|
| 111   | Urbana              |
|       | IL                  |
| 112   | New york city       |
|       | NY                  |

After applying split multi-value cells, the first and second rows refer to the same record, and the last two refer to the same record.

The problem with this choice is, all the values from other columns remain empty. Even though OpenRefine knows those rows are linked together. But this information will be missing once exported from the OpenRefine.

Method II Split into multiple columns
By inputting the separator, OpenRefine will split the composite values and create multiple new columns to fulfill individual cell values.

Data quality issues:

  • This could result in too many new columns with many missing values if there are several long composite values.
  • cell values across the new columns could be misaligned. Because they might share different lengths.

My initial thoughts on dealing with this problem are:

  • if using row-level splitting, we need to create another table with the composite values, and link this table with the original table by the foreign key.
  • if using column-level splitting, some Machine-Learning based method might be introduced to help control the splitting and matching, instead of just randomly distributing the data values after splitting with new columns.

My questions are:

  1. Will OpenRefine consider supporting multiple tables manipulation in the future?
  2. Will OpenRefine consider enabling data semantic type detection on data values to help split the composite values into multiple columns?

Even though I think people can always preprocess the data with other tools and then upload it back to OpenRefine, making OpenRefine a holistic tool for data cleaning will still be an intriguing goal. Isn’t it.

Thanks,
Lan

Hi Lan,

When you use the split multi-value cells operations you create something that OpenRefine calls a record.

You can use this structure when exporting the data in a format that supports nesting, such as JSON or XML.

In case of exporting the data in a tabular format like CSV and using it in programs that do not support the records concept from OpenRefine, you have to somehow dumb the data down.

The operation that OpenRefine offers you for this is Fill Down, which will take a value and repeat it into all following empty rows (simplified).

Your second scenario (column splitting) is somehow interesting regarding the type of information.

So New York City is a city and NY is a state and you might want to have one column with the name of the city and one for the name of the state. For this OpenRefine offers you regular expressions for either column splitting or adding a new column based on the current column.

Most of the time this is enough to cleanly separate most of the data with some manual adjustments for the rest.

If you have some more elaborate text, it is quite easy to call an ML enabled API for Named-entity-recognition from OpenRefine. See for example Using the OpenAI API to apply natural language queries to cells/data.

So this does not answer your specific questions, as these are more in the line of “How will OpenRefine evolve in the future”. I am just trying to put your problem description in the context of how I would tackle them in OpenRefine.

Cheers Benjamin

1 Like

Will OpenRefine consider supporting multiple tables manipulation in the future?

There is interest in thinking about what that would look like at least. @steve has been working on this. He is working as a researcher on this topic:

Personally I would find it really great to make progress in this area, but it feels like there is still much design work to do before we can plan on implementing anything. Your thoughts on the subject are very welcome.

For more thoughts about the limitations of the records mode, see also this thread:

1 Like

Thanks so much for all the detailed suggestions, Benjamin!

  • “You can use this structure when exporting the data in a format that supports nesting, such as JSON or XML.”
  • Yes, I did try exporting with JSON. Similarly, rows are not linked; all the other column values are saved as “null” in the key-value pairs.
    But I agree that I can use Fill Down to process the data before exporting.
  • “For this OpenRefine offers you regular expressions for either column splitting or adding a new column based on the current column.”
  • One realistic problem with messy real-world data is that they won’t follow one pattern. In other words, there could be many missing values in different positions. So I suspect how robust the regular expression can handle this situation. One potential solution could be first defining each data value position and patching the missing values before splitting.
    NER is definitely what we might need to invite here. Thanks so much for the reference. :wink:

Cheers,
Lan

Really appreciate all the references and papers shared here!

The default export of OpenRefine does not consider the record structure. You have to use GREL, Jython or Clojure to transform the records into a nested format.

This depends on the data you are working with. In my experience you often have large chunks of data following certain rules and a rest that is not easily captured by rules.

With OpenRefine you can use Facets and Filters to isolate these chunks of data and work your way through the dataset chunk by chunk. Therefore you don't have to find one regular expression to rule them all.

But in a world with LLMs you might start expecting more automation of data cleaning software.

I never know this! Thanks!