Downloading Wikimedia/Wikipedia image content in bulk - Wikitech-l

7 Oct 2015

Hi all,

[Sorry I'm new to the list so I'm not sure if a similar discussion has 
happened before or if the questions appear naive.]

I am working with a masters student and another colleagues on Wikimedia 
image data. The idea is to combine the meta-data and some descriptors 
computed from the content of the images in Wikimedia with the structured 
data of DBpedia/Wikidata to (hopefully) create a semantic search service 
over these images.

The goal would ultimately be to enable queries such as "give me images 
of cathedrals in Europe" or "give me images where an Iraqi politician 
met an American politician" or "give me pairs of similar images where 
the first image is a Spanish national football player and the second 
image is of somebody else". These queries are executed based on the 
combination of structured data from DBpedia/Wikidata, and standard image 
descriptors (used, e.g., for searching for similar images).

The goal is ambitious but from our side, nothing looks infeasible. If 
you are interested, a sketch of some of the more technical details of 
our idea are given in this short workshop paper:

http://aidanhogan.com/docs/imgpedia_amw2015.pdf

In any case, for this project, we would need to get the meta-data and 
the image content itself for as many of the Wikimedia images linked from 
Wikipedia as possible. So our questions would be:

* How many images are we talking about in Wikimedia (considering most 
recent version, for example)?
* How many are linked from Wikipedia (e.g., English, any language)?
* What overall on-disk size would those images be?
* What would be the best way to access/download those images in bulk?
* How could we get the meta-data as well?

Any answers or hints on where to look would be great.

 From our own searches, it seems the number of Wikimedia images is 
around 23 million and those used on Wikipedia (all languages) is around 
6 million, so we're talking about a ball-park of maybe 10 terabytes of 
raw image content? We know we can extract a list of relevant Wikidata 
images from the Wikipedia dump. In terms of getting image content and 
meta-data in bulk, crawling is not a great option for obvious reasons 
... the possible options we found mentioned on the Web were:

1. The following mirror for rsynching image data: 
http://ftpmirror.your.org/pub/wikimedia/images/

2. The All Images API to get some meta-data for images (but not the 
content). https://www.mediawiki.org/wiki/API:Allimages

So the idea we are looking at right now is to get images from 1. and 
then try match them with the meta-data from 2. Would this make the most 
sense? Also, the only documentation for 1. we could find was:

	https://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps#Media

Is there more of a description on how the folder structure is organised 
and how, e.g., to figure out the URL of each image?

Any hints or feedback would be great.

Best/thanks,
Aidan