I am currently working on my PhD research and I would be very grateful if you could help me to figure out how to proceed with my case.
I have a spreadsheet on assembled publications records from 3 databases - Scopus, WoS and Dimensions. Obviously, it contains duplication records as some publications appear in all of three databases. But also there are exclusive publications, which are indexed by only one database. What I need is to get overloop of publications coverage between these 3 databases based on DOI data. Therefore, I have been wondering if OpenRefine can help me to do so.
My dataset looks like this.
Is it possible to delete duplicates from DOI column and marge information in âSourceâ cells to see if the publication was indexed by different databases. In my imagination ideally it would look like this.
Hi Natalia and welcome - yes this is definitely an issue that OpenRefine can help with.
OpenRefine has a mechanism to group rows together in ârecordsâ (see Exploring data | OpenRefine). Once you have the rows grouped into records then you can combine the information in that record/group of rows together.
What you will need to do in this case is:
Make sure DOI is the first column in your project
Apply a Sort to the DOI column
Once the Sort has been applied you need to choose the option to âReorder rows permanentlyâ (see (Sort and view | OpenRefine)
You should now have a project where any duplicate DOIs appear one after another in the DOI column
Now in the DOI column choose the option Edit cells -> Blank down from the column dropdown menu
If it doesnât happen automatically, switch to âShow as Recordsâ (this is at the top left of the data grid)
The steps 1-6 above will group together all the rows with the same DOI into a single Record.
Now you can work to join together the information in the source column:
In the Source column choose Edit cells -> Join multi-valued cells
Specify a separator character that will be used - to get the outcome you illustrate above you could just use a space character. However Iâd generally recommend something a little less common (just in case any of your source names contain a space) - e.g. a pipe | or tilde ~ - but itâs completely up to you
This will join together all the values that are originally spread across multiple rows in a record into a single cell joined together by the separator character you specify.
Donât be put off by the number of steps above - this is a really common operation in OpenRefine and it should work pretty smoothly!
Best wishes and let me know if there is anything that needs clarification or you encounter any problems.
For some reason, joining data source values in a single cell did not work. I must made a mistake somewhere. Could it be because DOI column and Sourcecolumn are not next to each other?
As @b2m says it looks like the âBlank downâ process in step 5 hasnât worked as expected. Once youâve done the blank down Iâd expect your project to look something like:
Note that I was in âRowsâ mode when I did the âBlank downâ operation, and now Iâve switched to Records mode - which you can see has grouped together the rows that had the identical DOIs - and those duplicate DOIs no longer display.
Running the âJoin multi-valued cellsâ on the Source column at this point works as exepected for me
Please do share some screenshots from data in OpenRefine at different stages of the process if you are still having issues as it may help show whatâs going wrong