Extract data from one column and move to a new column

Hi,
I need to separate a column of subjects into three different columns: topical, geographic, and form. I can supply parts of strings that identify which type of subject is appropriate.
Is that possible?
Thanks.

Hi @Nancy_Sack

Yes this is absolutely possible. Could you provide an example to help us say how you could approach this?

The dataset contains about 9000 records with metadata for dissertations and theses. In the subjects column are one or more headings for each record. Some subject strings look like this: Dwellings--Hawaii--Honolulu or like this: Pohnpei (Micronesia)--Kings and rulers. I'd like to end up with this: Dwellings and Kings and rulers in the topical column; Hawaii, Honolulu, and Pohnpei (Micronesia) in the geographical column. There are also format terms like Dictionaries or Maps, which I'd also like to split out.
TIA for your help

Thanks Nancy.

Splitting up the string is quite straight forward, the challenge is likely to be knowing which of the headings is topical, which geographic and which form - the examples you've shared show that we can't rely on a standard patter for this. There are at least a couple of options for getting the result you need, and it may depend on how many unique headings you are dealing with.

The approach I'd suggest is:

  1. Split the values into separate cells using the menu option Edit cells -> Split multi-valued cells with the separator --
  2. Now each heading is in it's own cell you can use a text Facet to see a list of all the different subject headings

This is where it gets trickier - depending on the number of unique headings you have in your facet, you could either decide to label by hand:

  1. Duplicate the column with the headings so you have a copy using Edit column -> Add column based on this column with the GREL expression of just value (this is the default) - this will create a copy of your headings

  2. From the text facet, edit each value from the heading to the heading type. Because you have a duplicate column, you can edit the values in one of the columns to be the type - by editing via the facet you make the change to all occurences of a single value in the column - so for example you'd edit each of Pohnpei (Micronesia), Hawaii and Honolulu to be geographic - then in one column you'd have the original heading, and in the other the heading type

I've been trying to work out a way you could achieve the right outcome without going through each heading by hand - I'd initially assumed you could do a lookup for each heading say via id.loc.gov to find out what type of heading it was or using OCLC Fast headings - but I can't find a straightforward way to approach this sadly. Christine Harlow did write a LCSH/LCNAF reconciliation service which would be a good approach, but this is no longer maintained - you can take a look at their documentation at GitHub - cmharlow/lc-reconcile: OpenRefine Reconciliation Service for the LCNAF and LCSH from id.loc.gov and it might be possible to get that working if you are determined and happy to dive into some technical tasks.

Others here may have better suggestions ! @thadguidry @felixlohmeier

Thanks, Owen. I'll give it a try.

Thanks, Owen. I've started to do what you described but the data are really messy. The geographic term may be in the first, second, or third segment of the heading and there are often multiple headings in a single record. Once I identify the terms in a column that are geographic, how do I move those terms to a different column (called "spatial")?

Nancy

Hi Nancy, there are probably a few different ways to tackle this but I'd suggest something like my steps 3 & 4 above:

  1. Duplicate the column with the headings so you have a copy using Edit column -> Add column based on this column with the GREL expression of just value (this is the default) - this will create a copy of your headings
  2. From the text facet, edit each value from the heading to the heading type. Because you have a duplicate column, you can edit the values in one of the columns to be the type - by editing via the facet you make the change to all occurences of a single value in the column - so for example you'd edit each of Pohnpei (Micronesia), Hawaii and Honolulu to be geographic - then in one column you'd have the original heading, and in the other the heading type

Basically it's the "make a copy" step that's key here - this gives you a column where you can change the subject heading to be the type. So for example if we start with a column called "Subject Heading" like:

Subject Heading
Dwellings--Hawaii--Honolulu

Then transform that into separate values in a single column:

Subject Heading
Dwellings
Hawaii
Honolulu

Then we can make a copy (step 3 above):

Subject Heading SH Copy
Dwellings Dwellings
Hawaii Hawaii
Honolulu Honolulu

Finally you can change the value in the SH Copy column to be the type - there are different ways of doing this, but I suggested via a Facet in Step 4 above:

Subject Heading SH Copy
Dwellings Topical
Hawaii Spatial
Honolulu Spatial

Hope that makes sense?

Owen,
Thanks again for your assistance. What you outline makes perfect sense. My goal is to have one or more separate columns in relevant records for spatial, topical, temporal, and form types. How do I isolate the terms identified with each type and then move them to separate columns?

Nancy

To tedious. We should make an extension for Subject Headings that understands the common separators used “-“ and “—“ and “,” and that will then pivot and create record rows. There’s lots of little macros or generalized pipeline steps in the Library world that someone should build extensions for. (or extension pack)

Once you've got two columns, one with the headings, and one with the type like:

Subject Heading Heading type
Dwellings Topical
Hawaii Spatial
Honolulu Spatial

Then you can use a facet on the "Heading type" column and choose one value (say Spatial). This will filter your rows to only include those that have the Spatial headings (you will need to check you are in "Rows" not "Record" view for this to work correctly).

Then from the "Subject Heading" column you can do "Add column based on this column" again - this will create a new column which only contains Spatial headings. You can do this for each type of subject heading you have (Topical, Format etc).

Finally switch back to Records mode to group rows back into their original records. At that point I think you'll have what you need. You may need to do an operation of "Join multi-valued cells" on and then "Split multi-valued cells" on each of the new columns to get rid of blank spaces etc. - but that is just cosmetic tweaking at the end I think

Owen

That's very helpful. Thanks, Owen.
Nancy