I have a column of Mid, and I want to extract the File:URL…

Antoine2711 · May 17, 2024, 6:26pm

I have a column with a lot of Mid (ex. M141414969).
How can I get the File URL (i.e.: https://commons.wikimedia.org/wiki/File:MdM_Micheline_Legendre_en_1975.jpg) of that image?

Regards, Antoine

Gnoeee · May 18, 2024, 5:58am

Right now I'm not sure if there is a direct way in OR similar to getting captions, labels, descriptions and properties from Wikibase. But one way can be fetching the URL (ex: https://commons.wikimedia.org/entity/M141414969) and we will get the data which includes the title of the file. So we can construct the File URL from that.

Antoine2711 · May 21, 2024, 3:18am

I will get a redirect (HTTP 301 or 302), but not sure how I can extract that in OR…

Regards, Antoine

ostephens · May 21, 2024, 6:36pm

I think if you set the HTTP "Accept' Header to be application/json I think it will return the JSON rather than redirect:

Then maybe something like
value.parseJson().entities.get(cells["Img_WCID"].value).title
to extract the file name?

tfmorris · May 21, 2024, 8:59pm

Instead of using content negotiation headers, you can also just modify the URL:

https://commons.wikimedia.org/wiki/Special:EntityData/M141414969.json

although "title" seems like a very odd place to store the file name and I'm not sure how reliable it is.

Perhaps a better approach would be to look at the RDF/XML or Turtle

https://commons.wikimedia.org/wiki/Special:EntityData/M141414969.ttl

https://commons.wikimedia.org/wiki/Special:EntityData/M141414969.rdf

which has (in Turtle) all the attributes of the image object:

sdc:M141414969 a schema:MediaObject,
schema:ImageObject ;
schema:encodingFormat "image/jpeg" ;
schema:contentUrl <https://upload.wikimedia.org/wikipedia/commons/2/2c/MdM_Micheline_Legendre_en_1975.jpg> ;
schema:url <http://commons.wikimedia.org/wiki/Special:FilePath/MdM%20Micheline%20Legendre%20en%201975.jpg> ;
schema:contentSize "18217"^^xsd:integer ;
schema:height "140"^^xsd:integer ;
schema:width "234"^^xsd:integer .

Andre_Costa · May 22, 2024, 2:11pm

"title" in the EntityData should be fairly reliable for getting the image page associated with an Mid.

Since the Mid is generated from the pageId you could also just shave of the M and query https://commons.wikimedia.org/w/api.php?action=query&format=json&pageids=141414969 to get title value that way.

Antoine2711 · May 23, 2024, 4:33am

@Andre_Costa: This seems fast. Do you know if I can query many on the same call?

This: https://commons.wikimedia.org/w/api.php?action=query&format=json&pageids=141414969,141414970,141414968

Is not working.

Regards, Antoine

Antoine2711 · May 23, 2024, 4:34am

@Gnoeee: I dismissed your answer too fast. It was working, but it’s slow. Thanks.

Regards, Antoine

Antoine2711 · May 23, 2024, 4:38am

@tfmorris: the Json seems easier to parse. It’s just slow.
Thanks for the alternative solutions.

Regards, Antoine

Andre_Costa · May 23, 2024, 10:48am

The separation is done using either the pipe-character

e.g. https://commons.wikimedia.org/w/api.php?action=query&format=json&pageids=141414969|141414970|141414968

You are allowed max 50 at a go (unless you are logged in with special permissions)

Topic		Replies	Views
Issue retrieving captions from Wikimedia Commons Support and Helpdesk	4	158	January 11, 2024
Publishing multiple files to Wikimedia Commons Data cleaning and transformations wikimedia-commons	6	387	November 23, 2022
Documentation for the Wikimedia Common decription field OpenRefine documentation wikimedia-commons	1	160	January 1, 2024
Upload new image version to existing file on Wikimedia Commons Support and Helpdesk wikimedia-commons	12	822	September 26, 2024
Where to look for files deleted by administrators on Wikimedia? Support and Helpdesk wikimedia-commons	6	393	October 24, 2023

I have a column of Mid, and I want to extract the File:URL…

Related topics