Hi all,
[Sorry I'm new to the list so I'm not sure if a similar discussion has happened before or if the questions appear naive.]
I am working with a masters student and another colleagues on Wikimedia image data. The idea is to combine the meta-data and some descriptors computed from the content of the images in Wikimedia with the structured data of DBpedia/Wikidata to (hopefully) create a semantic search service over these images.
The goal would ultimately be to enable queries such as "give me images of cathedrals in Europe" or "give me images where an Iraqi politician met an American politician" or "give me pairs of similar images where the first image is a Spanish national football player and the second image is of somebody else". These queries are executed based on the combination of structured data from DBpedia/Wikidata, and standard image descriptors (used, e.g., for searching for similar images).
The goal is ambitious but from our side, nothing looks infeasible. If you are interested, a sketch of some of the more technical details of our idea are given in this short workshop paper:
http://aidanhogan.com/docs/imgpedia_amw2015.pdf
In any case, for this project, we would need to get the meta-data and the image content itself for as many of the Wikimedia images linked from Wikipedia as possible. So our questions would be:
* How many images are we talking about in Wikimedia (considering most recent version, for example)? * How many are linked from Wikipedia (e.g., English, any language)? * What overall on-disk size would those images be? * What would be the best way to access/download those images in bulk? * How could we get the meta-data as well?
Any answers or hints on where to look would be great.
From our own searches, it seems the number of Wikimedia images is around 23 million and those used on Wikipedia (all languages) is around 6 million, so we're talking about a ball-park of maybe 10 terabytes of raw image content? We know we can extract a list of relevant Wikidata images from the Wikipedia dump. In terms of getting image content and meta-data in bulk, crawling is not a great option for obvious reasons ... the possible options we found mentioned on the Web were:
1. The following mirror for rsynching image data: http://ftpmirror.your.org/pub/wikimedia/images/
2. The All Images API to get some meta-data for images (but not the content). https://www.mediawiki.org/wiki/API:Allimages
So the idea we are looking at right now is to get images from 1. and then try match them with the meta-data from 2. Would this make the most sense? Also, the only documentation for 1. we could find was:
https://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps#Media
Is there more of a description on how the folder structure is organised and how, e.g., to figure out the URL of each image?
Any hints or feedback would be great.
Best/thanks, Aidan