On Fri, Jan 8, 2010 at 8:25 PM, Robert Rohde rarohde@gmail.com wrote:
While I certainly can't fault your good will, I do find it disturbing that it was necessary. Ideally, Wikimedia should have internal backups of sufficient quality that we don't have to depend on what third parties happen to have saved for any circumstance short of meteors falling from the heavens.
Yea, well, you can't easily eliminate all the internal points of failure. "someone with root loses control of their access and someone nasty wipes everything" is really hard to protect against with online systems.
Avoiding the case where some failure is reliably replicated among all of WMF's copies (which was the case in the deletions I recovered, they were redundant copies, which were deleted too) can be best accomplished with an air-gap.
And meteors *do* fall, if rarely. WMF can be robust against that— for only the price of making all the data available, which is something worth doing for other principled and practical reasons.
Within wikimedia means that Wikimedia remains a single point of failure. This is too easy to avoid. Disk space is cheap, and not your problem. At least a few third parties will create and maintain full copies and this is a good thing.
Moreover it allowed things like image hashing before we had that in the database, and it would allow perceptual lossy hash matching if I ever got around to implementing tools to access the output.
If the goal is some version of "do something useful for Wikimedia", then it actually seems rather bizarre to have the first step be "copy X TB of gradually changing data to privately owned and managed servers". For Wikimedia applications, it would seem much more natural to make tools and technology available to do such things within Wikimedia. That way developers could work on such problems without having to worry about how much disk space they can personally afford. Again, there is nothing wrong with you generously doing such things with your own resources, but ideally running duplicate repositories for the benefit of Wikimedia should be unnecessary.
Within wikimedia means within Wikimedia's means, priorities, and politics. Having it locally means that if I decide that I want to decide to saturate a dozen cores computing perceptual hashes for a week I don't have to convince anyone else that its a good use of resources. I don't have to convince wikimedia to fund a project, I don't have to take up resources which might be better used by someone else, I don't have to set any expectations that I might not live up to.
Of course, its great to have public resources 'locally' (which is what the toolserver is for), it doesn't cover all cases.
There really are use cases. Moreover, making complete copies of the public data available as dumps to the public is a WMF board supported initiative.
I agree with the goal of making WMF content available, but given that we don't offer any image dump right now and a comprehensive dump as such would be usable to almost no one, then I don't think a classic dump is where we should start. Even you don't seem to want that. If I understand correctly, you'd like to have an easier way to reliably download individual image files. You wouldn't actually want to be presented with some form of monolithic multi-terabyte tarball each month.
No one wants the monolithic tarball. The way I got updates previously was via a rsync push.
No one sane would suggest a monolithic tarball: it's too much of a pain to produce!
Image dump != monolithic tarball.
But I think producing subsets is pretty much worthless. I can't think of a valid use for any reasonably sized subset. ("All media used on big wiki X" is a useful subset I've produced for people before, but it's not small enough to be a big win vs a full copy)
[snip]
The general point I am trying to make is that if we think about what people really want, and how the files are likely to be used, then there may be better delivery approaches than trying to create huge image dumps.
If all is made available then everyone's wants can be satisfied. No subset is going to get us there. Of course, there are a lot of possibilities for the means of transmission, but I think it would be most useful to assume that at least a few people are going to want to grab everything.