I have a large spreadsheet (e.g. book titles, with authors and publishers) that I wish to convert into a relational database. Apart from cleaning up the data, I need to normalise it (i.e., get it to the 3rd Normal form) so that the repeating data, like authors, publisher, etc., can be held in separate tables. Can OpenRefine be used to do this? Is there a guide, an example? Has anyone done this or is there a better tool for handling that step?
I have a large spreadsheet (e.g. book titles, with authors and publishers) that I wish to convert into a relational database. Apart from cleaning up the data, I need to normalise it (i.e., get it to the 3rd Normal form) so that the repeating data, like authors, publisher, etc., can be held in separate tables. Can OpenRefine be used to do this? Is there a guide, an example? Has anyone done this or is there a better tool for handling that step?
You can definitely do this, or large portions of it, in OpenRefine.
Depending on the shape of your data and how messy it is, I'd suggest using "Cluster & edit" to standardize spelling for publishers, etc and possibly reconciling against Wikidata as a way to resolve strings to strong identifiers. (which you could potentially use as your keys). If you already have tables of authors and publishers, the cross() function might also be of use.