Dear Jeremy,
Happy New Year, and thanks for your e-mail of 2012-12-31.
0) ADMINISTRATIVE
I fixed the capitalization of "Wikimedia" in both documentation and home page. I am now subscribed to this list and have read the last two years of postings.
1) SPDY
WP-MIRROR 0.5 and prior versions, obtains image files from http://upload.wikimedia.org/. SPDY would reduce latency. WP-MIRROR 0.6 (not yet released) uses HTTP/1.1 persistent connections. WP-MIRROR 0.6 has built-in profiling, and the image downloading process now uses 64% less (wall clock) time. Therefore SPDY may not provide much advantage. Thanks also for informing me of the image tarballs.
Conclusion: I will not pursue SPDY, for lack of a requirement. Action Item: WP-MIRROR 0.6 will make use of image tarballs.
2) METALINK
WP-MIRROR 0.5 and prior version, had to deal with thousands of corrupt image files. Most of these were partial downloads. cURL would time-out and leave corrupt files. I currently deal with that by validating the images. Validation, however, consumes a lot of time. So I am looking for ways to improve the reliability of downloading.
Metalink was brought to my attention by Jason Skomorowski. Relevant documents are RFC 5854, RFC 6249. From the later we have:
"This document describes a mechanism by which the benefit of mirrors can be automatically and more effectively realized. All the information about a download, including mirrors, cryptographic hashes, digital signatures, and more can be transferred in coordinated HTTP header fields, hereafter referred to as a "Metalink". This Metalink transfers the knowledge of the download server (and mirror database) to the client. Clients can fall back to other mirrors if the current one has an issue. With this knowledge, the client is enabled to work its way to a successful download even under adverse circumstances. All this can be done without complicated user interaction, and the download can be much more reliable and efficient. In contrast, a traditional HTTP redirect to a mirror conveys only minimal information -- one link to one server -- and there is no provision in the HTTP protocol to handle failures. Furthermore, in order to provide better load distribution across servers and potentially faster downloads to users, Metalink/HTTP facilitates multi-source downloads, where portions of a file are downloaded from multiple mirrors (and, optionally, Peer-to-Peer) simultaneously. Upon connection to a Metalink/HTTP server, a client will receive information about other sources of the same resource and a cryptographic hash of the whole resource. The client will then be able to request chunks of the file from the various sources, scheduling appropriately in order to maximize the download rate."
The benefit to WP-MIRROR would be much more reliable downloads, that would obviate the file validation process.
The benefit to folks on this e-main list are: a) Your mirror sites would get more traffic (Ariel mentioned that they are getting very little); b) the download process (for metalink capable clients) would be robust against the outage of any one mirror; and c) metalink capable clients are now common (cURL, kget, ...).
I understand that the idea for metalink originated in those who posted GNU/Linux distributions in .iso format. With each new .iso release, there would be a surge of downloading, causing many partial downloads (i.e. much wasted bandwidth). Metalink helped spread the load; and, by transporting hashes, improved download integrity.
Conclusion: I will table the issue of metalink, for lack of an immediate requirement. Action Item: WP-MIRROR 0.6 will incorporate your list of dump/tarball mirror sites as a configurable parameter.
3) RSYNC
Thanks for letting me know that dumps and tarballs are available using rsync. I much prefer rsync over http and ftp. I mirror the Debian archive, and recently switched from apt-mirror which uses wget, to ftpsync which uses rsync; and am very happy with the results.
Action Item: WP-MIRROR 0.6 will make use of rsync.
Ariel raised some other points with I shall address in a separate email.
Sincerely Yours, Kent
On Wed, Jan 2, 2013 at 4:30 PM, wp mirror wpmirrordev@gmail.com wrote:
Happy New Year, and thanks for your e-mail of 2012-12-31.
You too!
- ADMINISTRATIVE
I fixed the capitalization of "Wikimedia" in both documentation and home page. I am now subscribed to this list and have read the last two years of postings.
I think I saw it was already fixed a few days ago. Thanks.
- METALINK
[...]
I still need to read more about metalink but I think it squarely falls in the "we'll support it if someone does the work to make it happen" category. (In case that wasn't clear before, you're welcome to do that work yourself if you like.)
- SPDY
WP-MIRROR 0.5 and prior versions, obtains image files from http://upload.wikimedia.org/. SPDY would reduce latency. WP-MIRROR 0.6 (not yet released) uses HTTP/1.1 persistent connections. WP-MIRROR 0.6 has built-in profiling, and the image downloading process now uses 64% less (wall clock) time. Therefore SPDY may not provide much advantage. Thanks also for informing me of the image tarballs.
Conclusion: I will not pursue SPDY, for lack of a requirement. Action Item: WP-MIRROR 0.6 will make use of image tarballs.
- RSYNC
Thanks for letting me know that dumps and tarballs are available using rsync. I much prefer rsync over http and ftp. I mirror the Debian archive, and recently switched from apt-mirror which uses wget, to ftpsync which uses rsync; and am very happy with the results.
Action Item: WP-MIRROR 0.6 will make use of rsync.
Are you in any of the Debian mirror rotations? (in DNS. or maybe it's an internal mirror?) If you can handle the load / storage then you could become a Wikimedia mirror as well. Actually I'm typing this message on a squeeze box right now. :)
btw, for image tarballs: I'm not sure offhand if they exist for all wikis? I think maybe some larger wikis don't have them. rsync is also available for the images individually. (1 file per file)
-Jeremy
Στις 02-01-2013, ημέρα Τετ, και ώρα 16:48 +0000, ο/η Jeremy Baron έγραψε:
btw, for image tarballs: I'm not sure offhand if they exist for all wikis? I think maybe some larger wikis don't have them. rsync is also available for the images individually. (1 file per file)
The only project that does not have media tarball is commons; just use rsync for that. Every other project has two sets of tarballs, with media locally hosted (uploaded to the project) and remotely hosted (served form commons).
Ariel
Dear Jeremy,
On 1/2/13, Jeremy Baron jeremy@tuxmachine.com wrote:
Are you in any of the Debian mirror rotations? (in DNS. or maybe it's an internal mirror?) If you can handle the load / storage then you could become a Wikimedia mirror as well. Actually I'm typing this message on a squeeze box right now. :)
My Debian mirror is a leaf node, and is used internally. Previously, when I used apt-mirror, I downloaded from http://ftp.us.debian.org/debian/ which is a primary site (DNS round robin). When I switched to ftpsync, I decided to rsync with debian.gtisc.gatech.edu which is a secondary mirror.
I do not have sufficient storage or bandwidth to mirror much of the Wikimedia Foundation's collection (approaching 100T ?). I do mirror a few of the wikipedias (en, simple, xh, zu, and cho), and then only the latest pages and articles, and their image files. This is primarily for the purpose of developing WP-MIRROR and for internal use. Most of my dump/image download experiments are conducted behind a web caching proxy so as to avoid wasting the Foundation's bandwidth (and my own).
Sincerely Yours, Kent
Ob 02/01/13 17:30, wp mirror wrote:
- METALINK
WP-MIRROR 0.5 and prior version, had to deal with thousands of corrupt image files. Most of these were partial downloads. cURL would time-out and leave corrupt files. I currently deal with that by validating the images. Validation, however, consumes a lot of time. So I am looking for ways to improve the reliability of downloading.
What do you mean by the “file validation process” ?
You can check the download against the sha1 it should have (in most cases, there are a few files with broken hashes, or missing revisions...).
Dear Platonides,
Happy New Year. Thank you for your email.
Here is a sketch of the issues with image files:
1) IMAGE FILE NAME
Issue: The enwiki has many image files which names contain control characters (ampersand, asterisk, backquote, percent, question, etc.). The problem may be illustrated by trying the following slightly hazardous examples:
(shell)$ curl -O http://star*.jpg (shell)$ curl -O http://foo%60ls%60bar.jpg
Countermeasures: a) When WP-MIRROR 0.5 scrapes the dump for image file names, such file names are immediately dropped from further consideration; and b) After downloading is complete, I sequester any image files which name contains a percent, because such file names cause `rebuildImages.php --missing', to fail.
2) PARTIAL DOWNLOADS
Issue: HTTP is not very reliable. I see thousands of partially downloaded image files littering $wgScriptPath/images/[0-9a-f]/.
The ideal solution would be to use digests. Digests are fast. However, while dump file digests (md5sum) are posted, image file checksums are not. Nor have I seen any digest metadata transported in HTTP headers as would be the case with metalink. So, I run the following:
Countermeasures: WP-MIRROR 0.5 and prior validate all image files by executing:
(shell)$ gm identify -verbose <filename> 2>&1 | grep identify
For most files, this test produces no output. When there is output, it is always an error message. There is a collection of such messages in the WP-MIRROR 0.5 Reference Manual http://www.nongnu.org/wp-mirror/manual/wp-mirror-0.5.pdf. See in particular:
Appendix E.14 Experiments with Corrupt Images Appendix E.19 Messages
Bad image files are sequestered to $wgScriptPath/images/bad-images/[0-9a-f]/ for later manual inspection. Running `gm identify -verbose' is the time consuming step that I mentioned in a previous e-mail; and it is one that I would like to obviate if possible.
3) INVALID FILES
Issue: Many downloaded image files, upon closer inspection, turnout to be error messages produced by a nearby web caching proxy. A copy of one such message is given in:
Appendix E.19.2 Filename Issues
Countermeasures: After downloading, all image files smaller than 1K are grepped for "302 Redirected". Bad image files are sequestered to $wgScriptPath/images/bad-images/[0-9a-f]/ for later manual inspection.
4) SHA1SUM
I read with interest your remarks below about SHA1. I am aware of the img_sha1 field in the images table. However, I am unable to reproduce the values that I find there. As a concrete example, consider the file 'Arc_en_ciel.png' which appears in simplewiki:
(shell)$ env printf %s Arc_en_ciel.png | openssl dgst -md5 (stdin)= 00135a44372c142bd509367a9f166733
So far, so good. The file is indeed stored under $wgScriptPath/images/0/00/.
(rootshell)# openssl dgst -sha1 0/00/Arc_en_ciel.png SHA1(0/00/Arc_en_ciel.png)= fd67104be2338dea99e1211be8b6824d3b271c38
(shell)$ mysql --host=localhost --user=root --password Password: .... mysql> SELECT img_sha1,img_name FROM simplewiki.image WHERE img_name='Arc_en_ciel.png'; +---------------------------------+-----------------+ | img_sha1 | img_name | +---------------------------------+-----------------+ | tllx8mwbr31uissi6a9jq86836d6vy0 | Arc_en_ciel.png | +---------------------------------+-----------------+ 1 row in set (0.00 sec)
Not hexadecimal. The data type is VARBINARY(32), so I try conversion:
mysql> select HEX(img_sha1),img_name from simplewiki.image where img_name='Arc_en_ciel.png'; +----------------------------------------------------------------+-----------------+ | HEX(img_sha1) | img_name | +----------------------------------------------------------------+-----------------+ | 746C6C78386D776272333175697373693661396A7138363833366436767930 | Arc_en_ciel.png | +----------------------------------------------------------------+-----------------+ 1 row in set (0.14 sec)
Still not a match.
Perhaps you could help me understand how these digests are computed.
Sincerely Yours, Kent
On 1/2/13, Platonides platonides@gmail.com wrote:
Ob 02/01/13 17:30, wp mirror wrote:
- METALINK
WP-MIRROR 0.5 and prior version, had to deal with thousands of corrupt image files. Most of these were partial downloads. cURL would time-out and leave corrupt files. I currently deal with that by validating the images. Validation, however, consumes a lot of time. So I am looking for ways to improve the reliability of downloading.
What do you mean by the “file validation process” ?
You can check the download against the sha1 it should have (in most cases, there are a few files with broken hashes, or missing revisions...).
On 03/01/13 00:02, wp mirror wrote:
Dear Platonides,
Happy New Year. Thank you for your email.
Here is a sketch of the issues with image files:
- IMAGE FILE NAME
Issue: The enwiki has many image files which names contain control characters (ampersand, asterisk, backquote, percent, question, etc.). The problem may be illustrated by trying the following slightly hazardous examples:
(shell)$ curl -O http://star*.jpg (shell)$ curl -O http://foo%60ls%60bar.jpg
Obviously, you should have been using: $ curl -O 'http://star*.jpg' $ curl -O 'http://foo%60ls%60bar.jpg'
If you simply pass the parameters without quoting to curl, well, that's a bad idea. Specially since you don't seem to be treating $ specially...
Countermeasures: a) When WP-MIRROR 0.5 scrapes the dump for image file names, such file names are immediately dropped from further consideration; and b) After downloading is complete, I sequester any image files which name contains a percent, because such file names cause `rebuildImages.php --missing', to fail.
- PARTIAL DOWNLOADS
Issue: HTTP is not very reliable. I see thousands of partially downloaded image files littering $wgScriptPath/images/[0-9a-f]/.
The ideal solution would be to use digests. Digests are fast. However, while dump file digests (md5sum) are posted, image file checksums are not. Nor have I seen any digest metadata transported in HTTP headers as would be the case with metalink. So, I run the following:
(...) Digests are available in the db (see below), except a few errors as mentioned.
- INVALID FILES
Issue: Many downloaded image files, upon closer inspection, turnout to be error messages produced by a nearby web caching proxy. A copy of one such message is given in:
Appendix E.19.2 Filename Issues
Countermeasures: After downloading, all image files smaller than 1K are grepped for "302 Redirected". Bad image files are sequestered to $wgScriptPath/images/bad-images/[0-9a-f]/ for later manual inspection.
- SHA1SUM
I read with interest your remarks below about SHA1. I am aware of the img_sha1 field in the images table. However, I am unable to reproduce the values that I find there. As a concrete example, consider the file 'Arc_en_ciel.png' which appears in simplewiki:
(shell)$ env printf %s Arc_en_ciel.png | openssl dgst -md5 (stdin)= 00135a44372c142bd509367a9f166733
So far, so good. The file is indeed stored under $wgScriptPath/images/0/00/.
(rootshell)# openssl dgst -sha1 0/00/Arc_en_ciel.png SHA1(0/00/Arc_en_ciel.png)= fd67104be2338dea99e1211be8b6824d3b271c38
(shell)$ mysql --host=localhost --user=root --password Password: .... mysql> SELECT img_sha1,img_name FROM simplewiki.image WHERE img_name='Arc_en_ciel.png'; +---------------------------------+-----------------+ | img_sha1 | img_name | +---------------------------------+-----------------+ | tllx8mwbr31uissi6a9jq86836d6vy0 | Arc_en_ciel.png | +---------------------------------+-----------------+ 1 row in set (0.00 sec)
Not hexadecimal. The data type is VARBINARY(32), so I try conversion:
mysql> select HEX(img_sha1),img_name from simplewiki.image where img_name='Arc_en_ciel.png'; +----------------------------------------------------------------+-----------------+ | HEX(img_sha1) | img_name | +----------------------------------------------------------------+-----------------+ | 746C6C78386D776272333175697373693661396A7138363833366436767930 | Arc_en_ciel.png | +----------------------------------------------------------------+-----------------+ 1 row in set (0.14 sec)
Still not a match.
Perhaps you could help me understand how these digests are computed.
Those are sha1 in base-36. You will need to convert from base-36 to base-16 to get the “classical output”.
Dear Platonides,
On 1/2/13, Platonides platonides@gmail.com wrote:
- IMAGE FILE NAME
-----snip-----
Obviously, you should have been using: $ curl -O 'http://star*.jpg' $ curl -O 'http://foo%60ls%60bar.jpg'
If you simply pass the parameters without quoting to curl, well, that's a bad idea. Specially since you don't seem to be treating $ specially...
Of course. I learned the quoting rules for /bin/sh, sql, and many other systems. My point is really about risk tolerance. The image file `star*.jpg' is one real example of what was downloaded using an early version of WP-MIRROR, which I then rewrote to block. I am averse to file names that contain wild cards and other control characters. I can handle them safely *almost* all the time. But,
(shell)$ rm 'star*.jpg' <-- one day I will forget to do this, (shell)$ rm star*.jpg <-- and will instead do this (with collateral damage).
Murphy's Law: Work two days straight, inadvertently delete three days work, discover backup tape is unreadable.
-----snip-----
- SHA1SUM
-----snip-----
(rootshell)# openssl dgst -sha1 0/00/Arc_en_ciel.png SHA1(0/00/Arc_en_ciel.png)= fd67104be2338dea99e1211be8b6824d3b271c38
-----snip-----
mysql> SELECT img_sha1,img_name FROM simplewiki.image WHERE img_name='Arc_en_ciel.png'; +---------------------------------+-----------------+ | img_sha1 | img_name | +---------------------------------+-----------------+ | tllx8mwbr31uissi6a9jq86836d6vy0 | Arc_en_ciel.png | +---------------------------------+-----------------+ 1 row in set (0.00 sec)
-----snip-----
Those are sha1 in base-36. You will need to convert from base-36 to base-16 to get the “classical output”.
Can't test this with MySQL function CONV() which is limited to 64bit, so let's try:
(shell)$ clisp -q -q [1]> (string-downcase (format nil "~36r" #xfd67104be2338dea99e1211be8b6824d3b271c38)) "tllx8mwbr31uissi6a9jq86836d6vy0"
Its a match. Excellent! Thank you very much.
Action Item: WP-MIRROR 0.6 shall use SHA1 digests to validate image files.
Sincerely Yours, Kent
xmldatadumps-l@lists.wikimedia.org