On Mon, Aug 15, 2011 at 3:14 PM, Russell N. Nelson - rnnelson < rnnelson@clarkson.edu> wrote:
Anyway, you can't rely on the media files being stored in a filesystem. They could be stored in a database or an object storage. So *sync is not available.
Note that at this moment, upload file storage on the Wikimedia sites is still 'some big directories on an NFS server', but at some point is planned to migrate to a Swift cluster backend: http://www.mediawiki.org/wiki/Extension:SwiftMedia
A file dump / bulk-fetching intermediary would probably need to speak to MediaWiki to get lists of available files and then bump back via the backend to actually obtain them.
There's no reason why this couldn't speak something like the rsync protocol, of course.
I don't know how the media files are backed up. If you only want the
originals, that's a lot less than 12TB (or whatever is the current number for thumbs+origs). If you just want to fetch a tarball, wget or curl will automaticallly restart a connection and supply a range parameter if the server supports it. If you want a ready-to-use format, then you're going to need a client which can write individual files. But it's not particularly efficient to stream 120B files over a separate TCP connection. You'd have to have a client which can do TCP session reuse. No matter how you cut it, you're looking at a custom client. But there's no need to invent a new download protocol or stream format. That's why I suggest tarball and range. Standards ... they're not just for breakfast.
Range on a tarball assumes that you have a static tarball file -- or else predictable, unchanging snapshot of its contents that can be used to simulate one:
1) every filename in the data set, in order 2) every file's exact size and version 3) every other bit of file metadata that might go into constructing that tarball
or else actually generating and storing a giant tarball, and then keeping it around long enough for all clients to download the whole thing -- obviously not very attractive.
Since every tiny change (*any* new file, *any* changed file, *any* deleted file) would alter the generated tarball and shift terabytes of data around, this doesn't seem like it would be a big win for anything other than initial downloads of the full data set (or else batching up specifically-requested files).
Anything that involves updating your mirror/copy/fork/backup needs to work in a more live fashion, that only needs to transfer new data for things that have changed. rsync can check for differences but still needs to go over the full file list (and so still takes a Long Time and lots of bandwidth just to do that).
-- brion