On Mon, Aug 15, 2011 at 4:30 PM, Russell N. Nelson - rnnelson <
rnnelson(a)clarkson.edu> wrote:
Wasn't suggesting that GNU tar or PDtar or any
suchlike would be usable.
I'm pretty sure that whatever protocol is used, you probably can't have a
standard client or server simply because of the size of the data to be
transfered. Maybe rsync would work with a custom rsyncd? Not so familiar
with that protocol. Doesn't it compute an md5 for all files and ship it
around?
rsync's wire protocol isn't very well documented, but roughly speaking, it
builds a "file list" of every file that may need to be transferred, and the
server and client compare notes to see which ones will actually need to get
transferred (and then which pieces of the files need to be transferred, if
they exist in both places).
Since rsync 3 the file list can be built and sent incrementally, which made
it possible to do batch rsyncs of Wikimedia uploads to/from a couple of
ad-hoc off-site servers (I think Greg Maxwell ran one for a while? I do not
know whether any of these are still in place -- other people manage these
servers now and I just haven't paid attention).
Older versions of rsync would build and transfer the entire file list first,
which was impractical when a complete file list for millions and millions of
files would be bigger than RAM and take hours just to generate. :)
A custom rsync daemon could certainly speak to regular rsync clients to
manage doing the file listings and pulling up the appropriate backend file.
Simply re-starting the transfer can handle pulling updates or continuing a
broken transfer with no additional trouble.
For ideal incremental updates & recoveries you'd want to avoid having to
transfer data about unchanged files -- rsync will still have to send that
file list over so it can check if files need to be updated.
A more customized protocol might end up better at that; offhand I'm not sure
if rsync 3's protocol can be super convenient at that or whether something
else would be needed.
(For the most part we don't need rsync's ability to transfer pieces of large
individual files, though it's a win if a transfer gets interrupted on a
large video file; usually we just want to find *new* files or files that
need to be deleted. It may be possible to optimize this on the existing
protocol with timestamp limitations.)
-- brion