I have a column of Mid, and I want to extract the File:URL…

I have a column with a lot of Mid (ex. M141414969).
How can I get the File URL (i.e.: https://commons.wikimedia.org/wiki/File:MdM_Micheline_Legendre_en_1975.jpg) of that image?

Regards, Antoine

image

Right now I'm not sure if there is a direct way in OR similar to getting captions, labels, descriptions and properties from Wikibase. But one way can be fetching the URL (ex: https://commons.wikimedia.org/entity/M141414969) and we will get the data which includes the title of the file. So we can construct the File URL from that.

1 Like

I will get a redirect (HTTP 301 or 302), but not sure how I can extract that in OR…

Regards, Antoine

I think if you set the HTTP "Accept' Header to be application/json I think it will return the JSON rather than redirect:

Then maybe something like
value.parseJson().entities.get(cells["Img_WCID"].value).title
to extract the file name?

1 Like

Instead of using content negotiation headers, you can also just modify the URL:

https://commons.wikimedia.org/wiki/Special:EntityData/M141414969.json

although "title" seems like a very odd place to store the file name and I'm not sure how reliable it is.

Perhaps a better approach would be to look at the RDF/XML or Turtle

https://commons.wikimedia.org/wiki/Special:EntityData/M141414969.ttl

https://commons.wikimedia.org/wiki/Special:EntityData/M141414969.rdf

which has (in Turtle) all the attributes of the image object:

sdc:M141414969 a schema:MediaObject,
schema:ImageObject ;
schema:encodingFormat "image/jpeg" ;
schema:contentUrl <https://upload.wikimedia.org/wikipedia/commons/2/2c/MdM_Micheline_Legendre_en_1975.jpg> ;
schema:url <http://commons.wikimedia.org/wiki/Special:FilePath/MdM%20Micheline%20Legendre%20en%201975.jpg> ;
schema:contentSize "18217"^^xsd:integer ;
schema:height "140"^^xsd:integer ;
schema:width "234"^^xsd:integer .

2 Likes

"title" in the EntityData should be fairly reliable for getting the image page associated with an Mid.

Since the Mid is generated from the pageId you could also just shave of the M and query https://commons.wikimedia.org/w/api.php?action=query&format=json&pageids=141414969 to get title value that way.

2 Likes

@Andre_Costa: This seems fast. Do you know if I can query many on the same call?

This: https://commons.wikimedia.org/w/api.php?action=query&format=json&pageids=141414969,141414970,141414968

Is not working.

Regards, Antoine

@Gnoeee: I dismissed your answer too fast. It was working, but it’s slow. Thanks.

Regards, Antoine

@tfmorris: the Json seems easier to parse. It’s just slow.
Thanks for the alternative solutions.

Regards, Antoine

The separation is done using either the pipe-character

e.g. https://commons.wikimedia.org/w/api.php?action=query&format=json&pageids=141414969|141414970|141414968

You are allowed max 50 at a go (unless you are logged in with special permissions)