Avoid overwriting other properties after main item is reconciled to Q#?

I have a spreadsheet that I want to use to add information to Wikidata. Some of the information is already on Wikidata. Here is an example: The item for Launi K. Anderson is already reconciled to the main name column I have in OpenRefine. I have a separate author-id column I cobbled together from first and last names to be as close to the Mormon Literature and Creative Arts Database ID as possible (I used it to help with the matching, since we previously had manually matched a few hundred items with it using mixā€™nā€™match). Since the authorid property doesnā€™t match up to other Q#s on Wikipedia, I assume it is not reconcilable. But I wanted to somehow check to see if the Wikidata item for Mormon Literature and Creative Arts Database artist ID already exists for the items that I upload information for (I would rather not overwrite information or accidentally add a ā€œsecondā€ ID). Is it possible in OpenRefine to avoid uploading information on a specific property if it already exists in the Wikidata item?

Hi @Rachel_Helps

There are a couple of ways you can approach this - itā€™s a while since I did this, so I may be forgetting some of the detail

In the Wikidata schema editor for each property you add to the schema there is a ā€œConfgureā€ option which allows you to specify ā€œEditing modeā€ for the property between:

  • Add or merge
  • Add
  • Delete

And additionally you can set the ā€œMatching strategyā€

  • Property, value and qualifiers
  • Property and value
  • Property only

So I think you could set the matching strategy to ā€œProperty and valueā€ and the editing mode to ā€œAdd or mergeā€.

However, if this doesnā€™t behave as you need or you arenā€™t confident, you could instead use the option in OpenRefine to ā€œAdd columns from reconciled valuesā€ to bring the relevant property down from Wikidata and then filter out all the rows that already have that property before doing the ā€œUpload edits to wikidataā€ step (as only edits for rows in the current filter will be uploaded)

Hope this helps

Owen

2 Likes

You could use the Main Name column that you have reconciled to create a new column based on downloading the Mormon Literature and Creative Arts Database artist ID. Then select only those that lack that info to update.

2 Likes

If you do not want to add any additional statement on the item if it has already got one for that property, then I would recommend to use the combination ā€œAddā€ / ā€œProperty onlyā€ in the configuration dialog Owen mentioned. But indeed you can also do it manually by fetching existing values and filtering.

1 Like

It turns out my reverse-engineered author-id column isnā€™t as accurate as Iā€™d hoped. Sorry for the basic question, but how do I download data from reconciled items from Wikidata? I can at least compare it to the actual items.

You can use the ā€œAdd columns from reconciled valuesā€ menu option (in the column menu ā†’ Edit columns) to add the relevant information you want from Wikidata, then you can either keep it in OpenRefine, or you can export using the usual OpenRefine export options

1 Like

Thank you!! Iā€™m able to see that most of the artist IDs I cobbled together from first and last names match the actual IDs from matched items on Wikidata. I have another follow-up question. The artist IDs are important because they allow me to form the URL for the authorā€™s page in the database. I need this URL to be a source for information like birth year and birth place. Some of the URLs are malformed. Is there a way to automatically check and see which URLs give me a 404 error (Iā€™m thinking probably not, but it canā€™t hurt to ask)?

You can - if you use "add column by fetching URLsā€™ and check the option to ā€˜store errorsā€™ (see Column editing | OpenRefine) then any 404s will appear as errors in your added column, and any successful fetches should contain the HTML (or whatever format the pages are in) from the website

1 Like

one month later and I am still trying to figure out how to finish cleaning my data. I have two things that I want to do, and Iā€™m not sure how to do it. The way Iā€™ve imagined doing them is with an ā€œifā€ function, and I donā€™t know how that would transform into openrefine.

  1. some of the URLs didnā€™t fetch anything. I want to say ā€œif there is a URL-derived-from-name and nothing in its URL test column, delete that cellā€™s URLā€ (I was able to delete URLs with the 404 errors by faceting the URL test column by text and then deleting the matching ones. But I canā€™t do that for this instance).
  2. similarly, I have URLs from a second source, reconciled-URL. I want to merge it with the URL-derived-from-name, and prefer the reconciled-URL. To say this in a scripting language, I imagine Iā€™d say something like, if there is a reconciled-URL, go with that, else take the derived-from-name, else blank. Is this something I could use the existing ā€œJythonā€ library for?

There are several ways you could approach this but the simplest might be to use Facets.

Starting with the first case:

If you do a ā€˜facet by blankā€™ (see Exploring facets | OpenRefine) on the URL test column, that will give you all the rows where that column isnā€™t populated.

Iā€™m not completely sure of the project structure, but similarly you could do a facet on the ā€œURL-derived-from-nameā€ column

  • if the options are either URL or blank then you can just use ā€˜facet by blankā€™ again
  • if there is a more nuanced test (e.g. it can contain a URL or contain something else) you can do a ā€˜custom text facetā€™ (Exploring facets | OpenRefine) and in the expression write GREL like (just as an example): value.startsWith("http")

Once you have the two checks you need in a facet, you just select the right values in both facets, and the resulting rows will fulfil both conditions (e.g. one blank, one with a URL)

For the second case:

I have URLs from a second source, reconciled-URL. I want to merge it with the URL-derived-from-name, and prefer the reconciled-URL. To say this in a scripting language, I imagine Iā€™d say something like, if there is a reconciled-URL, go with that, else take the derived-from-name, else blank.

Then there are lots of options but assuming both these URLs are already in your project (I'm not clear?) you could do:

  1. Add new column based on your 'reconciled-URL' column, just to copy that into a new column
  2. Use Facet by blank on this new column to find the rows where there is no value
  3. Make sure only the rows with the blank cells are showing in the data grid and then do a Transformation on this column with GREL like cells["url-derived-from-name-column-name"].value

The last step will put in the value from your derived-from-name url column into the blank cells of your new column. And any remaining cells will still be blank

You can definitely achieve the same outcome with a GREL expression using conditional statements like if and with Jython as well - but I'd say the use of facets here is probably simpler.

I'm very happy to share some illustrations or examples or answer questions if you have them

1 Like

Although OpenRefine has an "if()" control, I usually recommend the use of facets as an easier to understand and debug option.

| Rachel_Helps
January 25 |

  • | - |

one month later and I am still trying to figure out how to finish cleaning my data. I have two things that I want to do, and Iā€™m not sure how to do it. The way Iā€™ve imagined doing them is with an ā€œifā€ function, and I donā€™t know how that would transform into openrefine.

  1. some of the URLs didnā€™t fetch anything. I want to say ā€œif there is a URL-derived-from-name and nothing in its URL test column, delete that cellā€™s URLā€ (I was able to delete URLs with the 404 errors by faceting the URL test column by text and then deleting the matching ones. But I canā€™t do that for this instance).

I would use "Facet by blank" (in the Customized facets menu) on the URL test column and then delete the contents of the URL-derived-from-name cells.

  1. similarly, I have URLs from a second source, reconciled-URL. I want to merge it with the URL-derived-from-name, and prefer the reconciled-URL. To say this in a scripting language, I imagine Iā€™d say something like, if there is a reconciled-URL, go with that, else take the derived-from-name, else blank. Is this something I could use the existing ā€œJythonā€ library for?

Make a copy of the reconciled-URL column (just in case), set up Facet by blank facets on both reconciled-URL-copy (=true) and derived-derived-from-name (=false) and copy the cell contents for all rows that match your filter.

Hope that helps!

Tom

1 Like

thank you both! I was able to figure out a way to facet by blank to fill in the URLs I wanted in the correct column.

1 Like