On 8/10/07, Minute Electron <minuteelectron(a)googlemail.com> wrote:
A tool has been created called Wikix, it downloads all
images from a wiki. I
I'm not aware of any method to check the validity of non-deleted*
files after downloading them via HTTP, beyond checking the size and
hoping the file isn't corrupted or downloading them more than once.
When you're talking about very nearly 1TB of images fetched via 1.75
million HTTP requests (current commons size) corruption is a real
issue if you care about getting a good copy. Errors that leave size
intact are quite possible, and fetching every file twice isn't really
a sane option for that much data.
I'm not aware of any efforts to download commons via HTTP. Previously
Jeff Merkey downloaded those that English Wikipedia uses, but thats
only part of a much larger collection.
I don't believe that moving that much data isn't really a major issue
itself at least for the sort of people that have the storage around to
handle it, back when I downloaded the old commons image dump (about
300gb) that we had posted the transfer took 4 days, which I don't
consider a big deal at all.
*deleted files are renamed to the SHA1 of their content, so it's easy
to check their transfer validity. I wish non-deleted images behaved in
the same way.