On Sep 23, 2013 9:25 AM, "Mihai Chintoanu" <mihai.chintoanu@skobbler.com> wrote:
> I have a list of about 1.8 million images which I have to download from commons.wikimedia.org. Is there any simple way to do this which doesn't involve an individual HTTP hit for each image?
You mean full size originals, not thumbs scaled to a certain size, right?
You should rsync from a mirror[0] (rsync allows specifying a list of files to copy) and then fill in the missing images from upload.wikimedia.org ; for upload.wikimedia.org I'd say you should throttle yourself to 1 cache miss per second (you can check headers on a response to see if was a hit or miss and then back off when you get a miss) and you shouldn't use more than one or two simultaneous HTTP connections. In any case, make sure you have an accurate UA string with contact info (email address) so ops can contact you if there's an issue.
At the moment there's only one mirror and it's ~6-12 months out of date so there may be a substantial amount to fill in. And of course you should be getting checksums from somewhere (the API?) and verifying them. If your images are all missing from the mirror than it should take around 40 days at 0.5 img/sec but I guess you probably could do it in less than 10 days if you have a fast enough pipe. (depends on if you get a lit of misses or hits)
See also [1] but not all of that applies because upload.wikimedia.org isn't MediaWiki. so e.g. no maxlag param.
-Jeremy
[0] https://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps#Media
[1] https://www.mediawiki.org/wiki/API:Etiquette