Merging and dedublicating

Hi everyone!

I am currently working on my PhD research and I would be very grateful if you could help me to figure out how to proceed with my case.

I have a spreadsheet on assembled publications records from 3 databases - Scopus, WoS and Dimensions. Obviously, it contains duplication records as some publications appear in all of three databases. But also there are exclusive publications, which are indexed by only one database. What I need is to get overloop of publications coverage between these 3 databases based on DOI data. Therefore, I have been wondering if OpenRefine can help me to do so.

My dataset looks like this.

Screenshot 2022-11-30 173255

Is it possible to delete duplicates from DOI column and marge information in “Source” cells to see if the publication was indexed by different databases. In my imagination ideally it would look like this.

Screenshot 2022-11-30 172304

Thank you!

Hi Natalia and welcome - yes this is definitely an issue that OpenRefine can help with.

OpenRefine has a mechanism to group rows together in ‘records’ (see Exploring data | OpenRefine). Once you have the rows grouped into records then you can combine the information in that record/group of rows together.

What you will need to do in this case is:

  1. Make sure DOI is the first column in your project
  2. Apply a Sort to the DOI column
  3. Once the Sort has been applied you need to choose the option to ‘Reorder rows permanently’ (see (Sort and view | OpenRefine)
  4. You should now have a project where any duplicate DOIs appear one after another in the DOI column
  5. Now in the DOI column choose the option Edit cells -> Blank down from the column dropdown menu
  6. If it doesn’t happen automatically, switch to “Show as Records” (this is at the top left of the data grid)

The steps 1-6 above will group together all the rows with the same DOI into a single Record.
Now you can work to join together the information in the source column:

  1. In the Source column choose Edit cells -> Join multi-valued cells
  2. Specify a separator character that will be used - to get the outcome you illustrate above you could just use a space character. However I’d generally recommend something a little less common (just in case any of your source names contain a space) - e.g. a pipe | or tilde ~ - but it’s completely up to you

This will join together all the values that are originally spread across multiple rows in a record into a single cell joined together by the separator character you specify.

Don’t be put off by the number of steps above - this is a really common operation in OpenRefine and it should work pretty smoothly!

Best wishes and let me know if there is anything that needs clarification or you encounter any problems.

Owen

1 Like

Dear Owen!

thank you so much for the detailed instruction. I will try to do it now

Dear Owen,

Thank you again for the useful explanation. Here is how the result looks like.

For some reason, joining data source values in a single cell did not work. I must made a mistake somewhere. Could it be because DOI column and Sourcecolumn are not next to each other?

Thank you so much!

Nataliia

The joining of multivalued cells only works if you have a record structure. For this you need to perform the steps 5 and 6 Owen described.

Sometimes the blank down has no effect when you are already in record mode or have a non permanent sort.

2 Likes

As @b2m says it looks like the “Blank down” process in step 5 hasn’t worked as expected. Once you’ve done the blank down I’d expect your project to look something like:

Note that I was in “Rows” mode when I did the “Blank down” operation, and now I’ve switched to Records mode - which you can see has grouped together the rows that had the identical DOIs - and those duplicate DOIs no longer display.

Running the “Join multi-valued cells” on the Source column at this point works as exepected for me

Please do share some screenshots from data in OpenRefine at different stages of the process if you are still having issues as it may help show what’s going wrong

3 Likes

Dear Benjamin,
Dear Owen,

thank you so much for the explanations. I indeed mixed up with row\record mode. Now everything works well.

That is amazing that such helpful community exists!

Regards,
Nataliia

3 Likes