On Sat, Jul 18, 2009 at 3:20 PM, David Gerarddgerard@gmail.com wrote:
2009/7/18 Alexandre Dulaunoy a@foo.be:
I was wondering if it would be possible to allow web robots to access http://upload.wikimedia.org/wikipedia/commons/ to gather and mirror the media files. As this is pure HTTP, the mirroring could benefit from the caching mechanisms of HTTP object (instead of having a large dump containing all the media files, that is more difficult to cache/update).
I see lots of files on upload.wikimedia.org on Google Image Search already. Is that actually forbidden by our robots.txt?
It'd actually be better if Google properly indexed text pages whose name ends in .jpg or whatever ... but they're aware we'd like that, so it's up to them.
But the current directory listing (upload dir) is disallowed, for example :
http://upload.wikimedia.org/wikipedia/commons/8/8c/
Of course, the bot will be able to get the media files by following the links from the other pages but this is not very handy/effective to make a exact mirror of just the current media files repository.
Would it possible to enable directory listing of http://upload.wikimedia.org/wikipedia/commons and the following subdirectories?
Thanks for the feedback,