On Mon, Aug 15, 2011 at 4:30 PM, Russell N. Nelson - rnnelson < rnnelson@clarkson.edu> wrote:
Wasn't suggesting that GNU tar or PDtar or any suchlike would be usable. I'm pretty sure that whatever protocol is used, you probably can't have a standard client or server simply because of the size of the data to be transfered. Maybe rsync would work with a custom rsyncd? Not so familiar with that protocol. Doesn't it compute an md5 for all files and ship it around?
rsync's wire protocol isn't very well documented, but roughly speaking, it builds a "file list" of every file that may need to be transferred, and the server and client compare notes to see which ones will actually need to get transferred (and then which pieces of the files need to be transferred, if they exist in both places).
Since rsync 3 the file list can be built and sent incrementally, which made it possible to do batch rsyncs of Wikimedia uploads to/from a couple of ad-hoc off-site servers (I think Greg Maxwell ran one for a while? I do not know whether any of these are still in place -- other people manage these servers now and I just haven't paid attention).
Older versions of rsync would build and transfer the entire file list first, which was impractical when a complete file list for millions and millions of files would be bigger than RAM and take hours just to generate. :)
A custom rsync daemon could certainly speak to regular rsync clients to manage doing the file listings and pulling up the appropriate backend file. Simply re-starting the transfer can handle pulling updates or continuing a broken transfer with no additional trouble.
For ideal incremental updates & recoveries you'd want to avoid having to transfer data about unchanged files -- rsync will still have to send that file list over so it can check if files need to be updated.
A more customized protocol might end up better at that; offhand I'm not sure if rsync 3's protocol can be super convenient at that or whether something else would be needed.
(For the most part we don't need rsync's ability to transfer pieces of large individual files, though it's a win if a transfer gets interrupted on a large video file; usually we just want to find *new* files or files that need to be deleted. It may be possible to optimize this on the existing protocol with timestamp limitations.)
-- brion