On 8/10/07, Minute Electron minuteelectron@googlemail.com wrote:
A tool has been created called Wikix, it downloads all images from a wiki. I
I'm not aware of any method to check the validity of non-deleted* files after downloading them via HTTP, beyond checking the size and hoping the file isn't corrupted or downloading them more than once.
When you're talking about very nearly 1TB of images fetched via 1.75 million HTTP requests (current commons size) corruption is a real issue if you care about getting a good copy. Errors that leave size intact are quite possible, and fetching every file twice isn't really a sane option for that much data.
I'm not aware of any efforts to download commons via HTTP. Previously Jeff Merkey downloaded those that English Wikipedia uses, but thats only part of a much larger collection.
I don't believe that moving that much data isn't really a major issue itself at least for the sort of people that have the storage around to handle it, back when I downloaded the old commons image dump (about 300gb) that we had posted the transfer took 4 days, which I don't consider a big deal at all.
*deleted files are renamed to the SHA1 of their content, so it's easy to check their transfer validity. I wish non-deleted images behaved in the same way.