New subject: [WP-MIRROR] Questions regarding Metalink and SPDY

2 Jan 2013


      Dear Jeremy,
Happy New Year, and thanks for your e-mail of 2012-12-31.
0) ADMINISTRATIVE
I fixed the capitalization of "Wikimedia" in both documentation and home page.
I am now subscribed to this list and have read the last two years of postings.
1) SPDY
WP-MIRROR 0.5 and prior versions, obtains image files from
http://upload.wikimedia.org/.  SPDY would reduce latency.  WP-MIRROR
0.6 (not yet released) uses HTTP/1.1 persistent connections.
WP-MIRROR 0.6 has built-in profiling, and the image downloading
process now uses 64% less (wall clock) time.  Therefore SPDY may not
provide much advantage.  Thanks also for informing me of the image
tarballs.
Conclusion: I will not pursue SPDY, for lack of a requirement.
Action Item:  WP-MIRROR 0.6 will make use of image tarballs.
2) METALINK
WP-MIRROR 0.5 and prior version, had to deal with thousands of corrupt
image files.  Most of these were partial downloads.  cURL would
time-out and leave corrupt files.  I currently deal with that by
validating the images.  Validation, however, consumes a lot of time.
So I am looking for ways to improve the reliability of downloading.
Metalink was brought to my attention by Jason Skomorowski.  Relevant
documents are RFC 5854, RFC 6249.  From the later we have:
"This document describes a mechanism by which the benefit of mirrors
can be automatically and more effectively realized. All the
information about a download, including mirrors, cryptographic
hashes, digital signatures, and more can be transferred in
coordinated HTTP header fields, hereafter referred to as a
"Metalink". This Metalink transfers the knowledge of the download
server (and mirror database) to the client. Clients can fall back to
other mirrors if the current one has an issue. With this knowledge,
the client is enabled to work its way to a successful download even
under adverse circumstances. All this can be done without
complicated user interaction, and the download can be much more
reliable and efficient. In contrast, a traditional HTTP redirect to
a mirror conveys only minimal information -- one link to one server
-- and there is no provision in the HTTP protocol to handle failures.
Furthermore, in order to provide better load distribution across
servers and potentially faster downloads to users, Metalink/HTTP
facilitates multi-source downloads, where portions of a file are
downloaded from multiple mirrors (and, optionally, Peer-to-Peer)
simultaneously.
Upon connection to a Metalink/HTTP server, a client will receive
information about other sources of the same resource and a
cryptographic hash of the whole resource. The client will then be
able to request chunks of the file from the various sources,
scheduling appropriately in order to maximize the download rate."
The benefit to WP-MIRROR would be much more reliable downloads, that
would obviate the file validation process.
The benefit to folks on this e-main list are:  a) Your mirror sites
would get more traffic (Ariel mentioned that they are getting very
little); b) the download process (for metalink capable clients) would
be robust against the outage of any one mirror; and c) metalink
capable clients are now common (cURL, kget, ...).
I understand that the idea for metalink originated in those who posted
GNU/Linux distributions in .iso format.  With each new .iso release,
there would be a surge of downloading, causing many partial downloads
(i.e. much wasted bandwidth).  Metalink helped spread the load; and,
by transporting hashes, improved download integrity.
Conclusion:  I will table the issue of metalink, for lack of an
immediate requirement.
Action Item:  WP-MIRROR 0.6 will incorporate your list of dump/tarball
mirror sites as a configurable parameter.
3) RSYNC
Thanks for letting me know that dumps and tarballs are available using
rsync.  I much prefer rsync over http and ftp.  I mirror the Debian
archive, and recently switched from apt-mirror which uses wget, to
ftpsync which uses rsync; and am very happy with the results.
Action Item:  WP-MIRROR 0.6 will make use of rsync.
Ariel raised some other points with I shall address in a separate email.
Sincerely Yours,
Kent