On Mon, Aug 15, 2011 at 3:14 PM, Russell N. Nelson - rnnelson <
rnnelson(a)clarkson.edu> wrote:
Anyway, you can't rely on the media files being
stored in a filesystem.
They could be stored in a database or an object storage. So *sync is not
available.
Note that at this moment, upload file storage on the Wikimedia sites is
still 'some big directories on an NFS server', but at some point is planned
to migrate to a Swift cluster backend:
http://www.mediawiki.org/wiki/Extension:SwiftMedia
A file dump / bulk-fetching intermediary would probably need to speak to
MediaWiki to get lists of available files and then bump back via the backend
to actually obtain them.
There's no reason why this couldn't speak something like the rsync protocol,
of course.
I don't know how the media files are backed up. If you only want the
originals, that's a lot less than 12TB (or
whatever is the current number
for thumbs+origs). If you just want to fetch a tarball, wget or curl will
automaticallly restart a connection and supply a range parameter if the
server supports it. If you want a ready-to-use format, then you're going to
need a client which can write individual files. But it's not particularly
efficient to stream 120B files over a separate TCP connection. You'd have to
have a client which can do TCP session reuse. No matter how you cut it,
you're looking at a custom client. But there's no need to invent a new
download protocol or stream format. That's why I suggest tarball and range.
Standards ... they're not just for breakfast.
Range on a tarball assumes that you have a static tarball file -- or else
predictable, unchanging snapshot of its contents that can be used to
simulate one:
1) every filename in the data set, in order
2) every file's exact size and version
3) every other bit of file metadata that might go into constructing that
tarball
or else actually generating and storing a giant tarball, and then keeping it
around long enough for all clients to download the whole thing -- obviously
not very attractive.
Since every tiny change (*any* new file, *any* changed file, *any* deleted
file) would alter the generated tarball and shift terabytes of data around,
this doesn't seem like it would be a big win for anything other than initial
downloads of the full data set (or else batching up specifically-requested
files).
Anything that involves updating your mirror/copy/fork/backup needs to work
in a more live fashion, that only needs to transfer new data for things that
have changed. rsync can check for differences but still needs to go over the
full file list (and so still takes a Long Time and lots of bandwidth just to
do that).
-- brion