Exactly what I propose. Keep a list of files and their sizes, so that when somebody asks for a range, you can skip files up until you get to the range they've requested. Not worrying about new or already-downloaded changed files, or deleted files. You're not getting a "current" copy of the files, you're getting a copy of the files that were available when you started your download. Minus the deleted files, which by policy we shouldn't be handing out anyway.
rsync doesn't have the MW database to consult for changes.
________________________________________ From: wikitech-l-bounces@lists.wikimedia.org [wikitech-l-bounces@lists.wikimedia.org] on behalf of Brion Vibber [brion@pobox.com] Sent: Monday, August 15, 2011 6:31 PM To: Wikimedia developers Subject: Re: [Wikitech-l] forking media files
On Mon, Aug 15, 2011 at 3:14 PM, Russell N. Nelson - rnnelson < rnnelson@clarkson.edu> wrote:
download protocol or stream format. That's why I suggest tarball and range. Standards ... they're not just for breakfast.
Range on a tarball assumes that you have a static tarball file -- or else predictable, unchanging snapshot of its contents that can be used to simulate one:
1) every filename in the data set, in order 2) every file's exact size and version 3) every other bit of file metadata that might go into constructing that tarball
or else actually generating and storing a giant tarball, and then keeping it around long enough for all clients to download the whole thing -- obviously not very attractive.
Since every tiny change (*any* new file, *any* changed file, *any* deleted file) would alter the generated tarball and shift terabytes of data around, this doesn't seem like it would be a big win for anything other than initial downloads of the full data set (or else batching up specifically-requested files).
Anything that involves updating your mirror/copy/fork/backup needs to work in a more live fashion, that only needs to transfer new data for things that have changed. rsync can check for differences but still needs to go over the full file list (and so still takes a Long Time and lots of bandwidth just to do that).
-- brion _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l