On Fri, Jan 8, 2010 at 2:37 PM, Gregory Maxwell gmaxwell@gmail.com wrote:
Er. I've maintained a non-WMF disaster recovery archive for a long time, though its no longer completely current since the rsync went away and web fetching is lossy.
It saved our rear a number of times, saving thousands of images from irreparable loss.
While I certainly can't fault your good will, I do find it disturbing that it was necessary. Ideally, Wikimedia should have internal backups of sufficient quality that we don't have to depend on what third parties happen to have saved for any circumstance short of meteors falling from the heavens.
Moreover it allowed things like image hashing before we had that in the database, and it would allow perceptual lossy hash matching if I ever got around to implementing tools to access the output.
If the goal is some version of "do something useful for Wikimedia", then it actually seems rather bizarre to have the first step be "copy X TB of gradually changing data to privately owned and managed servers". For Wikimedia applications, it would seem much more natural to make tools and technology available to do such things within Wikimedia. That way developers could work on such problems without having to worry about how much disk space they can personally afford. Again, there is nothing wrong with you generously doing such things with your own resources, but ideally running duplicate repositories for the benefit of Wikimedia should be unnecessary.
There really are use cases. Moreover, making complete copies of the public data available as dumps to the public is a WMF board supported initiative.
I agree with the goal of making WMF content available, but given that we don't offer any image dump right now and a comprehensive dump as such would be usable to almost no one, then I don't think a classic dump is where we should start. Even you don't seem to want that. If I understand correctly, you'd like to have an easier way to reliably download individual image files. You wouldn't actually want to be presented with some form of monolithic multi-terabyte tarball each month.
Hence, I would say say it makes more sense to discuss way to make individual images and user specified subsets of images more easily available. The same gateways that could allow you to keep synchronized could also help other people to download individual files. Other goals could see functions like export pages expanded to include options for download all associated image files at the same time one downloads a set of wikitext.
The general point I am trying to make is that if we think about what people really want, and how the files are likely to be used, then there may be better delivery approaches than trying to create huge image dumps.
-Robert Rohde