Dear Jeremy,
Happy New Year, and thanks for your e-mail of 2012-12-31.
0) ADMINISTRATIVE
I fixed the capitalization of "Wikimedia" in both documentation and home page. I am now subscribed to this list and have read the last two years of postings.
1) SPDY
WP-MIRROR 0.5 and prior versions, obtains image files from http://upload.wikimedia.org/. SPDY would reduce latency. WP-MIRROR 0.6 (not yet released) uses HTTP/1.1 persistent connections. WP-MIRROR 0.6 has built-in profiling, and the image downloading process now uses 64% less (wall clock) time. Therefore SPDY may not provide much advantage. Thanks also for informing me of the image tarballs.
Conclusion: I will not pursue SPDY, for lack of a requirement. Action Item: WP-MIRROR 0.6 will make use of image tarballs.
2) METALINK
WP-MIRROR 0.5 and prior version, had to deal with thousands of corrupt image files. Most of these were partial downloads. cURL would time-out and leave corrupt files. I currently deal with that by validating the images. Validation, however, consumes a lot of time. So I am looking for ways to improve the reliability of downloading.
Metalink was brought to my attention by Jason Skomorowski. Relevant documents are RFC 5854, RFC 6249. From the later we have:
"This document describes a mechanism by which the benefit of mirrors can be automatically and more effectively realized. All the information about a download, including mirrors, cryptographic hashes, digital signatures, and more can be transferred in coordinated HTTP header fields, hereafter referred to as a "Metalink". This Metalink transfers the knowledge of the download server (and mirror database) to the client. Clients can fall back to other mirrors if the current one has an issue. With this knowledge, the client is enabled to work its way to a successful download even under adverse circumstances. All this can be done without complicated user interaction, and the download can be much more reliable and efficient. In contrast, a traditional HTTP redirect to a mirror conveys only minimal information -- one link to one server -- and there is no provision in the HTTP protocol to handle failures. Furthermore, in order to provide better load distribution across servers and potentially faster downloads to users, Metalink/HTTP facilitates multi-source downloads, where portions of a file are downloaded from multiple mirrors (and, optionally, Peer-to-Peer) simultaneously. Upon connection to a Metalink/HTTP server, a client will receive information about other sources of the same resource and a cryptographic hash of the whole resource. The client will then be able to request chunks of the file from the various sources, scheduling appropriately in order to maximize the download rate."
The benefit to WP-MIRROR would be much more reliable downloads, that would obviate the file validation process.
The benefit to folks on this e-main list are: a) Your mirror sites would get more traffic (Ariel mentioned that they are getting very little); b) the download process (for metalink capable clients) would be robust against the outage of any one mirror; and c) metalink capable clients are now common (cURL, kget, ...).
I understand that the idea for metalink originated in those who posted GNU/Linux distributions in .iso format. With each new .iso release, there would be a surge of downloading, causing many partial downloads (i.e. much wasted bandwidth). Metalink helped spread the load; and, by transporting hashes, improved download integrity.
Conclusion: I will table the issue of metalink, for lack of an immediate requirement. Action Item: WP-MIRROR 0.6 will incorporate your list of dump/tarball mirror sites as a configurable parameter.
3) RSYNC
Thanks for letting me know that dumps and tarballs are available using rsync. I much prefer rsync over http and ftp. I mirror the Debian archive, and recently switched from apt-mirror which uses wget, to ftpsync which uses rsync; and am very happy with the results.
Action Item: WP-MIRROR 0.6 will make use of rsync.
Ariel raised some other points with I shall address in a separate email.
Sincerely Yours, Kent