Hi all,
[Sorry I'm new to the list so I'm not sure if a similar discussion has
happened before or if the questions appear naive.]
I am working with a masters student and another colleagues on Wikimedia
image data. The idea is to combine the meta-data and some descriptors
computed from the content of the images in Wikimedia with the structured
data of DBpedia/Wikidata to (hopefully) create a semantic search service
over these images.
The goal would ultimately be to enable queries such as "give me images
of cathedrals in Europe" or "give me images where an Iraqi politician
met an American politician" or "give me pairs of similar images where
the first image is a Spanish national football player and the second
image is of somebody else". These queries are executed based on the
combination of structured data from DBpedia/Wikidata, and standard image
descriptors (used, e.g., for searching for similar images).
The goal is ambitious but from our side, nothing looks infeasible. If
you are interested, a sketch of some of the more technical details of
our idea are given in this short workshop paper:
http://aidanhogan.com/docs/imgpedia_amw2015.pdf
In any case, for this project, we would need to get the meta-data and
the image content itself for as many of the Wikimedia images linked from
Wikipedia as possible. So our questions would be:
* How many images are we talking about in Wikimedia (considering most
recent version, for example)?
* How many are linked from Wikipedia (e.g., English, any language)?
* What overall on-disk size would those images be?
* What would be the best way to access/download those images in bulk?
* How could we get the meta-data as well?
Any answers or hints on where to look would be great.
From our own searches, it seems the number of Wikimedia images is
around 23 million and those used on Wikipedia (all languages) is around
6 million, so we're talking about a ball-park of maybe 10 terabytes of
raw image content? We know we can extract a list of relevant Wikidata
images from the Wikipedia dump. In terms of getting image content and
meta-data in bulk, crawling is not a great option for obvious reasons
... the possible options we found mentioned on the Web were:
1. The following mirror for rsynching image data:
http://ftpmirror.your.org/pub/wikimedia/images/
2. The All Images API to get some meta-data for images (but not the
content).
https://www.mediawiki.org/wiki/API:Allimages
So the idea we are looking at right now is to get images from 1. and
then try match them with the meta-data from 2. Would this make the most
sense? Also, the only documentation for 1. we could find was:
https://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps#Media
Is there more of a description on how the folder structure is organised
and how, e.g., to figure out the URL of each image?
Any hints or feedback would be great.
Best/thanks,
Aidan