Combine Specific Parts of Two Spreadsheets

Hello,

I would appreciate some help trying to combine parts of two spreadsheets that contain many different values, but one standard value that could be used as a key. I will provide two links. One to the spreadsheet that contains the current metadata:

And to another that contains the “keys” which is where I want the values from the previous sheet to me moved to:

As you can see. The “key” is that the “Identifier ID” columns from both sheets match up in certain places, and I would like for the contents of the “Test Sheet-Metadata” sheet to be moved to the appropriate columns on the “IDs” sheet. I cannot simply copy and paste because I must confirm that the values of the two “Identifier ID” columns line up before doing so, and I might as well copy and paste individual rows in that case. I am hoping to be able to take care of this without going through the data line by line.

Thank you for your help with this!

Hi @trevorstratton1

Yes this is possible using the OpenRefine cross() function. I’ll describe how to do this below, and I’m always happy to see people using OpenRefine - but I think in this case you could also probably do this work using the “VLOOKUP” function in Excel or Google Sheets.

The cross() function in OpenRefine works by looking up values from one project in another project and makes the full data from any matching row available to you. Probably the trickiest part of the process is getting the data you retrieve from the cross() function to layout exactly as you want it in your project

To get started import both spreadsheets into OpenRefine in separate projects. I’ll assume these are called “Identifiers” and “Metadata”

To take an example of filling out the Title column in the “Identifiers” project - in the “Identifiers” project, from the “Title” column you can choose the menu option Edit cellsTransform... and in the Expression use the cross() function. The cross() function will look something like:

cells["Identifier (ID)"].cross("Metadata","Identifier (ID)")

This will do a lookup of the identifier from each cell in the “Identifier (ID)” column and and find any rows in the “Metadata” project that have the same value in the “Identifier (ID)” in that project. However by itself this isn’t quite enough. What you get back from this function is an array (list) of matching rows. As long as each Identifier only appears in the Metadata project once, you should only get back a single row in each case, but you may want to check that is the case (i.e. double check no duplicate IDs have been included in the Metadata project). Looking at the test data you’ve supplied I can see the Identifiers “00086_04_01_01_09_001_002-002” appears four times in the “Test Sheet-Metadata” file - so that’s something you may want to resolve before proceeding.

So you need to use an expression like:

cells["Identifier (ID)"].cross("Metadata","Identifier (ID)")[0].cells["Title"].value

This does the cross (as described above) and then takes only the first row returned (this is what [0] does) and then cells["Title"].value grabs the value from the Title column in the row you’ve matched in the “Metadata” project.

Using the method above you’d need to do the same process for each column you want to populate (Title, Creator, Associated People etc. etc.). But because of the way you’ve laid out the “IDs” sheet with all the columns you need to populate already there, there is a shortcut.

Instead of doing the columns one by one you can do the following:

  1. In the “Identifiers” project click the “All” column dropdown menu (first column in the project always) and select “Transform”
  2. In the expression box enter cells["Identifier (ID)"].cross("Metadata","Identifier (ID)")[0].cells[columnName].value
  3. Click “OK”
  4. In the “Select columns” dialog, uncheck columns called “Variable A” ,“Variable …” etc. and the “Identifier (ID)” column - so only the metadata columns (Title, Creator etc.) are selected
  5. Click OK

This will fill out all the columns selected in the “Select columns” dialog based on the equivalent column in the Metatdata project

Hope this helps

Owen

If you want to just match data in 2 speadsheets, no need for OpenRefine.

On Google Sheet, look at the QUERY() function. It is mighty powerful. Way more than our cool cross() function.

Regards,
Antoine

Thank you for these suggestions. I tried what you did with OpenRefine and what you suggested with VLOOKUP. Both work well and I’m back in business. Thanks!

Got it. Thank you for letting me know!

1 Like