Dandelion API and categories

Padraic · December 2, 2022, 3:49pm

There’s a very interesting blog post by Graham Jevron from the British Library. It’s on how to process archival metadata with OpenRefine.

In it he uses the Dandelion API to extract named entites. This is straight forward enough to do, it just requires a frree API token. However he categorises all the entities. See image here: https://blogs.bl.uk/.a/6a00d8341c464853ef025d9b488997200c-500wi

Can anyone explain how that is done? Was it done within OpenRefine or by the way the Dandelion API was queried? I can’t see anything in the Dandelion API documentation but I could well be missing something obvious.

antonin_d · December 3, 2022, 9:43am

I suspect the NER extension was used to do this extraction. Doing it manually in OpenRefine without the extension is likely possible but more involved.

tfmorris · December 4, 2022, 5:20pm

Can anyone explain how that is done? I can’t see anything in the Dandelion API documentation but I could well be missing something obvious.

You need to specify include=categories in your API call to get this information.
https://dandelion.eu/docs/api/datatxt/nex/v1/#parameters

Tom

ostephens · December 4, 2022, 10:36pm

I have some examples of using Dandelion to do NER in OpenRefine in this set of exercises Self-paced exercises 1: Variables, operators, and getting data from online sources - Google Docs

Hope it might help

Owen

Padraic · December 5, 2022, 2:38pm

That’s all very helpful, thanks folks.

@ostephens those exercises look great, thank you.

Padraic · December 13, 2022, 3:24pm

Hi @ostephens

Could I ask you about Exercise 5 in that document? I have used the GREL expression there to return People, Places, Companies, Buildings etc and it has been great.

I have been trying to modify the expression to return the spots for annotations where there is no type so that it returns the spots for Dandelion results like this;

{
	"time": 1,
	"annotations": [{
		"start": 11,
		"end": 27,
		"spot": "Grand Canal Dock",
		"confidence": 0.8609,
		"id": 10425918,
		"title": "Grand Canal Dock",
		"uri": "http://en.wikipedia.org/wiki/Grand_Canal_Dock",
		"abstract": "The west inner basin and Boland\u0027s Mill, January 2022View of the western (inner) basin from the top floor of the Google Docks (Montevetro) building. Boland\u0027s Mill, the Alto Vetro building, and The Marker Hotel can be seen.Grand Canal Dock is a Southside area near the city centre of Dublin, Ireland. It is located on the border of eastern Dublin 2 and the westernmost part of Ringsend in Dublin 4, surrounding the Grand Canal Docks, an enclosed harbour where the Grand Canal comes to the River Liffey. The area has undergone significant redevelopment since 2000, as part of the Dublin Docklands area redevelopment project.",
		"label": "Grand Canal Dock",
		"categories": ["Dublin Docklands", "Office buildings in the Republic of Ireland", "Places in Dublin (city)", "Ringsend", "Skyscrapers in the Republic of Ireland"],
		"types": [],
		"lod": {
			"dbpedia": "http://dbpedia.org/resource/Grand_Canal_Dock",
			"wikipedia": "http://en.wikipedia.org/wiki/Grand_Canal_Dock"
		}
	}],
	"lang": "en",
	"timestamp": "2022-12-06T14:49:58.182"
}

But to to avail. I have tried variations on leaving the types entry empty

forEach(filter(value.parseJson().get("annotations"),a,a.get("types").inArray("")),p,p.get("spot"))

or using not(inArray("http://dbpedia.org/") but haven’t managed to get anything to work.

Would you have any suggestions on what might succeed?

ostephens · December 13, 2022, 4:32pm

You could look for the length being zero I think - so the filter part of your expression would be:

filter(value.parseJson().get("annotations"),a,a.get("types").length()==0)

Padraic · December 13, 2022, 4:52pm

That’s brilliant Owen, thanks!

For anyone else who might want the same query this works:

forEach(filter(value.parseJson().get("annotations"),a,a.get("types").length()==0),p,p.get("spot")).join("|")

Padraic · December 14, 2022, 12:54pm

In these queries is it possible to return two fields, say “spot” and “lod”

this doesn’t work for example:

forEach(filter(value.parseJson().get("annotations"),a,a.get("types").length()==0),p,p.get("spot"+"lod")).join("|")

I hope it’s ok to ask this here. I couldn’t find anything obvious on the GREL functions page.

b2m · December 14, 2022, 1:38pm

Well… you could for example by executing p.get("spot") + "|" + "p.get("lod"). But I would not recommend that. I would recommend running the snippet twice… once for “spot” (maybe adding a new column?) and once for “lod” (adding another column?). Otherwise you end up having “lod” and “spot” mixed in one column and need to separate them.

ostephens · December 14, 2022, 3:09pm

I agree with @b2m . If you want to extract both values at once you can simply repeat the p.get with each field in your expression as necessary, but it may produce unexpected results as p.get doesn’t necessarily get a string (it could get a string, json object, array, number or boolean, as that’s what can be in any JSON element)

Padraic · December 15, 2022, 10:26am

Ideally I would have the spot and URI linked in one cell for now, which I would separate out later. I think that would be easier to work with than having a cell containing one or more spots, and a second cell containing one or more URIs and trying to match them up afterwards. though I might be onverthignking this.

Thanks again for the help.

ostephens · December 16, 2022, 12:03pm

Hi @Padraic I think @b2m’s suggestion achieves what you are looking for. So for example:

forEach(filter(value.parseJson().get("annotations"),a,a.get("types").length()==0),p,p.get("spot")+"~"+p.get("lod")).join("|")

Note that in this example I use “~” as a separator between the Spot and the LOD. You could equally use “|” but then if you had repeated annotations it wouldn’t be possible to separate the spot/LOD pairs easily

Padraic · December 16, 2022, 3:14pm

That’s great, thank you @ostephens .

Topic		Replies	Views
Using the OpenAI API to apply natural language queries to cells/data Support and Helpdesk hints-and-tips	5	728	February 4, 2023
Any communities that want their own categories, tags? Community Feedback	13	766	February 2, 2023
Can I use OpenRefine to clean up first and last names that are not valid, such as random words? Data cleaning and transformations	4	150	May 20, 2024
Reconcile data from GlobalNames API into separate columns Data cleaning and transformations reconciliation	2	248	November 3, 2023
OpenRefine 2024 Barcamp:: Reconciliation in OpenRefine Development & Design wikidata , wikibase , reconciliation , wikimedia-commons , barcamp-2024	0	65	July 9, 2024

Dandelion API and categories

Related topics