Avoid overwriting other properties after main item is reconciled to Q#?

Rachel_Helps · December 6, 2022, 10:40pm

I have a spreadsheet that I want to use to add information to Wikidata. Some of the information is already on Wikidata. Here is an example: The item for Launi K. Anderson is already reconciled to the main name column I have in OpenRefine. I have a separate author-id column I cobbled together from first and last names to be as close to the Mormon Literature and Creative Arts Database ID as possible (I used it to help with the matching, since we previously had manually matched a few hundred items with it using mix’n’match). Since the authorid property doesn’t match up to other Q#s on Wikipedia, I assume it is not reconcilable. But I wanted to somehow check to see if the Wikidata item for Mormon Literature and Creative Arts Database artist ID already exists for the items that I upload information for (I would rather not overwrite information or accidentally add a “second” ID). Is it possible in OpenRefine to avoid uploading information on a specific property if it already exists in the Wikidata item?

ostephens · December 7, 2022, 9:58am

Hi @Rachel_Helps

There are a couple of ways you can approach this - it’s a while since I did this, so I may be forgetting some of the detail

In the Wikidata schema editor for each property you add to the schema there is a “Confgure” option which allows you to specify “Editing mode” for the property between:

Add or merge
Add
Delete

And additionally you can set the “Matching strategy”

Property, value and qualifiers
Property and value
Property only

So I think you could set the matching strategy to “Property and value” and the editing mode to “Add or merge”.

However, if this doesn’t behave as you need or you aren’t confident, you could instead use the option in OpenRefine to “Add columns from reconciled values” to bring the relevant property down from Wikidata and then filter out all the rows that already have that property before doing the “Upload edits to wikidata” step (as only edits for rows in the current filter will be uploaded)

Hope this helps

Owen

dbigwood · December 7, 2022, 5:22pm

You could use the Main Name column that you have reconciled to create a new column based on downloading the Mormon Literature and Creative Arts Database artist ID. Then select only those that lack that info to update.

antonin_d · December 7, 2022, 6:46pm

If you do not want to add any additional statement on the item if it has already got one for that property, then I would recommend to use the combination “Add” / “Property only” in the configuration dialog Owen mentioned. But indeed you can also do it manually by fetching existing values and filtering.

Rachel_Helps · December 14, 2022, 6:38pm

It turns out my reverse-engineered author-id column isn’t as accurate as I’d hoped. Sorry for the basic question, but how do I download data from reconciled items from Wikidata? I can at least compare it to the actual items.

ostephens · December 14, 2022, 10:27pm

You can use the “Add columns from reconciled values” menu option (in the column menu → Edit columns) to add the relevant information you want from Wikidata, then you can either keep it in OpenRefine, or you can export using the usual OpenRefine export options

Rachel_Helps · December 14, 2022, 10:46pm

Thank you!! I’m able to see that most of the artist IDs I cobbled together from first and last names match the actual IDs from matched items on Wikidata. I have another follow-up question. The artist IDs are important because they allow me to form the URL for the author’s page in the database. I need this URL to be a source for information like birth year and birth place. Some of the URLs are malformed. Is there a way to automatically check and see which URLs give me a 404 error (I’m thinking probably not, but it can’t hurt to ask)?

ostephens · December 14, 2022, 11:17pm

You can - if you use "add column by fetching URLs’ and check the option to ‘store errors’ (see Column editing | OpenRefine) then any 404s will appear as errors in your added column, and any successful fetches should contain the HTML (or whatever format the pages are in) from the website

Rachel_Helps · January 25, 2023, 6:37pm

one month later and I am still trying to figure out how to finish cleaning my data. I have two things that I want to do, and I’m not sure how to do it. The way I’ve imagined doing them is with an “if” function, and I don’t know how that would transform into openrefine.

some of the URLs didn’t fetch anything. I want to say “if there is a URL-derived-from-name and nothing in its URL test column, delete that cell’s URL” (I was able to delete URLs with the 404 errors by faceting the URL test column by text and then deleting the matching ones. But I can’t do that for this instance).
similarly, I have URLs from a second source, reconciled-URL. I want to merge it with the URL-derived-from-name, and prefer the reconciled-URL. To say this in a scripting language, I imagine I’d say something like, if there is a reconciled-URL, go with that, else take the derived-from-name, else blank. Is this something I could use the existing “Jython” library for?

ostephens · January 25, 2023, 7:02pm

There are several ways you could approach this but the simplest might be to use Facets.

Starting with the first case:

If you do a ‘facet by blank’ (see Exploring facets | OpenRefine) on the URL test column, that will give you all the rows where that column isn’t populated.

I’m not completely sure of the project structure, but similarly you could do a facet on the “URL-derived-from-name” column

if the options are either URL or blank then you can just use ‘facet by blank’ again
if there is a more nuanced test (e.g. it can contain a URL or contain something else) you can do a ‘custom text facet’ (Exploring facets | OpenRefine) and in the expression write GREL like (just as an example): value.startsWith("http")

Once you have the two checks you need in a facet, you just select the right values in both facets, and the resulting rows will fulfil both conditions (e.g. one blank, one with a URL)

ostephens · January 25, 2023, 7:10pm

For the second case:

I have URLs from a second source, reconciled-URL. I want to merge it with the URL-derived-from-name, and prefer the reconciled-URL. To say this in a scripting language, I imagine I’d say something like, if there is a reconciled-URL, go with that, else take the derived-from-name, else blank.

Then there are lots of options but assuming both these URLs are already in your project (I'm not clear?) you could do:

Add new column based on your 'reconciled-URL' column, just to copy that into a new column
Use Facet by blank on this new column to find the rows where there is no value
Make sure only the rows with the blank cells are showing in the data grid and then do a Transformation on this column with GREL like cells["url-derived-from-name-column-name"].value

The last step will put in the value from your derived-from-name url column into the blank cells of your new column. And any remaining cells will still be blank

You can definitely achieve the same outcome with a GREL expression using conditional statements like if and with Jython as well - but I'd say the use of facets here is probably simpler.

I'm very happy to share some illustrations or examples or answer questions if you have them

tfmorris · January 25, 2023, 7:14pm

Although OpenRefine has an "if()" control, I usually recommend the use of facets as an easier to understand and debug option.

| Rachel_Helps
January 25 |

| - |

one month later and I am still trying to figure out how to finish cleaning my data. I have two things that I want to do, and I’m not sure how to do it. The way I’ve imagined doing them is with an “if” function, and I don’t know how that would transform into openrefine.

some of the URLs didn’t fetch anything. I want to say “if there is a URL-derived-from-name and nothing in its URL test column, delete that cell’s URL” (I was able to delete URLs with the 404 errors by faceting the URL test column by text and then deleting the matching ones. But I can’t do that for this instance).

I would use "Facet by blank" (in the Customized facets menu) on the URL test column and then delete the contents of the URL-derived-from-name cells.

similarly, I have URLs from a second source, reconciled-URL. I want to merge it with the URL-derived-from-name, and prefer the reconciled-URL. To say this in a scripting language, I imagine I’d say something like, if there is a reconciled-URL, go with that, else take the derived-from-name, else blank. Is this something I could use the existing “Jython” library for?

Make a copy of the reconciled-URL column (just in case), set up Facet by blank facets on both reconciled-URL-copy (=true) and derived-derived-from-name (=false) and copy the cell contents for all rows that match your filter.

Hope that helps!

Tom

Rachel_Helps · February 3, 2023, 6:21pm

thank you both! I was able to figure out a way to facet by blank to fill in the URLs I wanted in the correct column.

Topic		Replies	Views
Resolve using https://hub.toolforge.org/ Data cleaning and transformations hints-and-tips	11	461	February 12, 2023
Partial upload of a dataset? Support and Helpdesk reconciliation	4	259	November 30, 2022
Adding additional statements to Wikidata property Support and Helpdesk wikidata	22	914	October 10, 2023
Issue augmenting Wikidata QIDs in OR Support and Helpdesk	12	637	January 6, 2023
Issue with OpenRefine schema Support and Helpdesk wikidata	3	252	August 2, 2023

Avoid overwriting other properties after main item is reconciled to Q#?

Related topics