Dandelion API and categories

There’s a very interesting blog post by Graham Jevron from the British Library. It’s on how to process archival metadata with OpenRefine.

In it he uses the Dandelion API to extract named entites. This is straight forward enough to do, it just requires a frree API token. However he categorises all the entities. See image here: https://blogs.bl.uk/.a/6a00d8341c464853ef025d9b488997200c-500wi

Can anyone explain how that is done? Was it done within OpenRefine or by the way the Dandelion API was queried? I can’t see anything in the Dandelion API documentation but I could well be missing something obvious.

I suspect the NER extension was used to do this extraction. Doing it manually in OpenRefine without the extension is likely possible but more involved.

Can anyone explain how that is done? I can’t see anything in the Dandelion API documentation but I could well be missing something obvious.

You need to specify include=categories in your API call to get this information.
https://dandelion.eu/docs/api/datatxt/nex/v1/#parameters

Tom

2 Likes

I have some examples of using Dandelion to do NER in OpenRefine in this set of exercises Self-paced exercises 1: Variables, operators, and getting data from online sources - Google Docs

Hope it might help

Owen

That’s all very helpful, thanks folks.

@ostephens those exercises look great, thank you.

Hi @ostephens

Could I ask you about Exercise 5 in that document? I have used the GREL expression there to return People, Places, Companies, Buildings etc and it has been great.

I have been trying to modify the expression to return the spots for annotations where there is no type so that it returns the spots for Dandelion results like this;

{
	"time": 1,
	"annotations": [{
		"start": 11,
		"end": 27,
		"spot": "Grand Canal Dock",
		"confidence": 0.8609,
		"id": 10425918,
		"title": "Grand Canal Dock",
		"uri": "http://en.wikipedia.org/wiki/Grand_Canal_Dock",
		"abstract": "The west inner basin and Boland\u0027s Mill, January 2022View of the western (inner) basin from the top floor of the Google Docks (Montevetro) building. Boland\u0027s Mill, the Alto Vetro building, and The Marker Hotel can be seen.Grand Canal Dock is a Southside area near the city centre of Dublin, Ireland. It is located on the border of eastern Dublin 2 and the westernmost part of Ringsend in Dublin 4, surrounding the Grand Canal Docks, an enclosed harbour where the Grand Canal comes to the River Liffey. The area has undergone significant redevelopment since 2000, as part of the Dublin Docklands area redevelopment project.",
		"label": "Grand Canal Dock",
		"categories": ["Dublin Docklands", "Office buildings in the Republic of Ireland", "Places in Dublin (city)", "Ringsend", "Skyscrapers in the Republic of Ireland"],
		"types": [],
		"lod": {
			"dbpedia": "http://dbpedia.org/resource/Grand_Canal_Dock",
			"wikipedia": "http://en.wikipedia.org/wiki/Grand_Canal_Dock"
		}
	}],
	"lang": "en",
	"timestamp": "2022-12-06T14:49:58.182"
}

But to to avail. I have tried variations on leaving the types entry empty

forEach(filter(value.parseJson().get("annotations"),a,a.get("types").inArray("")),p,p.get("spot"))

or using not(inArray("http://dbpedia.org/") but haven’t managed to get anything to work.

Would you have any suggestions on what might succeed?

You could look for the length being zero I think - so the filter part of your expression would be:

filter(value.parseJson().get("annotations"),a,a.get("types").length()==0)

1 Like

That’s brilliant Owen, thanks!

For anyone else who might want the same query this works:

forEach(filter(value.parseJson().get("annotations"),a,a.get("types").length()==0),p,p.get("spot")).join("|")
1 Like

In these queries is it possible to return two fields, say “spot” and “lod”

this doesn’t work for example:

forEach(filter(value.parseJson().get("annotations"),a,a.get("types").length()==0),p,p.get("spot"+"lod")).join("|")

I hope it’s ok to ask this here. I couldn’t find anything obvious on the GREL functions page.

Well… you could for example by executing p.get("spot") + "|" + "p.get("lod"). But I would not recommend that. I would recommend running the snippet twice… once for “spot” (maybe adding a new column?) and once for “lod” (adding another column?). Otherwise you end up having “lod” and “spot” mixed in one column and need to separate them.

I agree with @b2m . If you want to extract both values at once you can simply repeat the p.get with each field in your expression as necessary, but it may produce unexpected results as p.get doesn’t necessarily get a string (it could get a string, json object, array, number or boolean, as that’s what can be in any JSON element)

Ideally I would have the spot and URI linked in one cell for now, which I would separate out later. I think that would be easier to work with than having a cell containing one or more spots, and a second cell containing one or more URIs and trying to match them up afterwards. though I might be onverthignking this.

Thanks again for the help.

Hi @Padraic I think @b2m’s suggestion achieves what you are looking for. So for example:

forEach(filter(value.parseJson().get("annotations"),a,a.get("types").length()==0),p,p.get("spot")+"~"+p.get("lod")).join("|")

Note that in this example I use “~” as a separator between the Spot and the LOD. You could equally use “|” but then if you had repeated annotations it wouldn’t be possible to separate the spot/LOD pairs easily

1 Like

That’s great, thank you @ostephens .