I have a spreadsheet that I want to use to add information to Wikidata. Some of the information is already on Wikidata. Here is an example: The item for Launi K. Anderson is already reconciled to the main name column I have in OpenRefine. I have a separate author-id column I cobbled together from first and last names to be as close to the Mormon Literature and Creative Arts Database ID as possible (I used it to help with the matching, since we previously had manually matched a few hundred items with it using mixānāmatch). Since the authorid property doesnāt match up to other Q#s on Wikipedia, I assume it is not reconcilable. But I wanted to somehow check to see if the Wikidata item for Mormon Literature and Creative Arts Database artist ID already exists for the items that I upload information for (I would rather not overwrite information or accidentally add a āsecondā ID). Is it possible in OpenRefine to avoid uploading information on a specific property if it already exists in the Wikidata item?
There are a couple of ways you can approach this - itās a while since I did this, so I may be forgetting some of the detail
In the Wikidata schema editor for each property you add to the schema there is a āConfgureā option which allows you to specify āEditing modeā for the property between:
- Add or merge
- Add
- Delete
And additionally you can set the āMatching strategyā
- Property, value and qualifiers
- Property and value
- Property only
So I think you could set the matching strategy to āProperty and valueā and the editing mode to āAdd or mergeā.
However, if this doesnāt behave as you need or you arenāt confident, you could instead use the option in OpenRefine to āAdd columns from reconciled valuesā to bring the relevant property down from Wikidata and then filter out all the rows that already have that property before doing the āUpload edits to wikidataā step (as only edits for rows in the current filter will be uploaded)
Hope this helps
Owen
You could use the Main Name column that you have reconciled to create a new column based on downloading the Mormon Literature and Creative Arts Database artist ID. Then select only those that lack that info to update.
If you do not want to add any additional statement on the item if it has already got one for that property, then I would recommend to use the combination āAddā / āProperty onlyā in the configuration dialog Owen mentioned. But indeed you can also do it manually by fetching existing values and filtering.
It turns out my reverse-engineered author-id column isnāt as accurate as Iād hoped. Sorry for the basic question, but how do I download data from reconciled items from Wikidata? I can at least compare it to the actual items.
You can use the āAdd columns from reconciled valuesā menu option (in the column menu ā Edit columns) to add the relevant information you want from Wikidata, then you can either keep it in OpenRefine, or you can export using the usual OpenRefine export options
Thank you!! Iām able to see that most of the artist IDs I cobbled together from first and last names match the actual IDs from matched items on Wikidata. I have another follow-up question. The artist IDs are important because they allow me to form the URL for the authorās page in the database. I need this URL to be a source for information like birth year and birth place. Some of the URLs are malformed. Is there a way to automatically check and see which URLs give me a 404 error (Iām thinking probably not, but it canāt hurt to ask)?
You can - if you use "add column by fetching URLsā and check the option to āstore errorsā (see Column editing | OpenRefine) then any 404s will appear as errors in your added column, and any successful fetches should contain the HTML (or whatever format the pages are in) from the website
one month later and I am still trying to figure out how to finish cleaning my data. I have two things that I want to do, and Iām not sure how to do it. The way Iāve imagined doing them is with an āifā function, and I donāt know how that would transform into openrefine.
- some of the URLs didnāt fetch anything. I want to say āif there is a URL-derived-from-name and nothing in its URL test column, delete that cellās URLā (I was able to delete URLs with the 404 errors by faceting the URL test column by text and then deleting the matching ones. But I canāt do that for this instance).
- similarly, I have URLs from a second source, reconciled-URL. I want to merge it with the URL-derived-from-name, and prefer the reconciled-URL. To say this in a scripting language, I imagine Iād say something like, if there is a reconciled-URL, go with that, else take the derived-from-name, else blank. Is this something I could use the existing āJythonā library for?
There are several ways you could approach this but the simplest might be to use Facets.
Starting with the first case:
If you do a āfacet by blankā (see Exploring facets | OpenRefine) on the URL test column, that will give you all the rows where that column isnāt populated.
Iām not completely sure of the project structure, but similarly you could do a facet on the āURL-derived-from-nameā column
- if the options are either URL or blank then you can just use āfacet by blankā again
- if there is a more nuanced test (e.g. it can contain a URL or contain something else) you can do a ācustom text facetā (Exploring facets | OpenRefine) and in the expression write GREL like (just as an example):
value.startsWith("http")
Once you have the two checks you need in a facet, you just select the right values in both facets, and the resulting rows will fulfil both conditions (e.g. one blank, one with a URL)
For the second case:
I have URLs from a second source, reconciled-URL. I want to merge it with the URL-derived-from-name, and prefer the reconciled-URL. To say this in a scripting language, I imagine Iād say something like, if there is a reconciled-URL, go with that, else take the derived-from-name, else blank.
Then there are lots of options but assuming both these URLs are already in your project (I'm not clear?) you could do:
- Add new column based on your 'reconciled-URL' column, just to copy that into a new column
- Use Facet by blank on this new column to find the rows where there is no value
- Make sure only the rows with the blank cells are showing in the data grid and then do a Transformation on this column with GREL like
cells["url-derived-from-name-column-name"].value
The last step will put in the value from your derived-from-name url column into the blank cells of your new column. And any remaining cells will still be blank
You can definitely achieve the same outcome with a GREL expression using conditional statements like if
and with Jython as well - but I'd say the use of facets here is probably simpler.
I'm very happy to share some illustrations or examples or answer questions if you have them
Although OpenRefine has an "if()" control, I usually recommend the use of facets as an easier to understand and debug option.
| Rachel_Helps
January 25 |
- | - |
one month later and I am still trying to figure out how to finish cleaning my data. I have two things that I want to do, and Iām not sure how to do it. The way Iāve imagined doing them is with an āifā function, and I donāt know how that would transform into openrefine.
- some of the URLs didnāt fetch anything. I want to say āif there is a URL-derived-from-name and nothing in its URL test column, delete that cellās URLā (I was able to delete URLs with the 404 errors by faceting the URL test column by text and then deleting the matching ones. But I canāt do that for this instance).
I would use "Facet by blank" (in the Customized facets menu) on the URL test column and then delete the contents of the URL-derived-from-name cells.
- similarly, I have URLs from a second source, reconciled-URL. I want to merge it with the URL-derived-from-name, and prefer the reconciled-URL. To say this in a scripting language, I imagine Iād say something like, if there is a reconciled-URL, go with that, else take the derived-from-name, else blank. Is this something I could use the existing āJythonā library for?
Make a copy of the reconciled-URL column (just in case), set up Facet by blank facets on both reconciled-URL-copy (=true) and derived-derived-from-name (=false) and copy the cell contents for all rows that match your filter.
Hope that helps!
Tom
thank you both! I was able to figure out a way to facet by blank to fill in the URLs I wanted in the correct column.