Normalising a spreadsheet

I have a large spreadsheet (e.g. book titles, with authors and publishers) that I wish to convert into a relational database. Apart from cleaning up the data, I need to normalise it (i.e., get it to the 3rd Normal form) so that the repeating data, like authors, publisher, etc., can be held in separate tables. Can OpenRefine be used to do this? Is there a guide, an example? Has anyone done this or is there a better tool for handling that step?

Thanks

| myfta
September 18 |

  • | - |

I have a large spreadsheet (e.g. book titles, with authors and publishers) that I wish to convert into a relational database. Apart from cleaning up the data, I need to normalise it (i.e., get it to the 3rd Normal form) so that the repeating data, like authors, publisher, etc., can be held in separate tables. Can OpenRefine be used to do this? Is there a guide, an example? Has anyone done this or is there a better tool for handling that step?

You can definitely do this, or large portions of it, in OpenRefine.

The recipes that users have contributed over the years are collected here: https://github.com/OpenRefine/OpenRefine/wiki/Recipes
but I'm not sure there's anything specific to this task.

Depending on the shape of your data and how messy it is, I'd suggest using "Cluster & edit" to standardize spelling for publishers, etc and possibly reconciling against Wikidata as a way to resolve strings to strong identifiers. (which you could potentially use as your keys). If you already have tables of authors and publishers, the cross() function might also be of use.

Tom