On 03/01/13 00:02, wp mirror wrote:
Dear Platonides,
Happy New Year. Thank you for your email.
Here is a sketch of the issues with image files:
- IMAGE FILE NAME
Issue: The enwiki has many image files which names contain control characters (ampersand, asterisk, backquote, percent, question, etc.). The problem may be illustrated by trying the following slightly hazardous examples:
(shell)$ curl -O http://star*.jpg (shell)$ curl -O http://foo%60ls%60bar.jpg
Obviously, you should have been using: $ curl -O 'http://star*.jpg' $ curl -O 'http://foo%60ls%60bar.jpg'
If you simply pass the parameters without quoting to curl, well, that's a bad idea. Specially since you don't seem to be treating $ specially...
Countermeasures: a) When WP-MIRROR 0.5 scrapes the dump for image file names, such file names are immediately dropped from further consideration; and b) After downloading is complete, I sequester any image files which name contains a percent, because such file names cause `rebuildImages.php --missing', to fail.
- PARTIAL DOWNLOADS
Issue: HTTP is not very reliable. I see thousands of partially downloaded image files littering $wgScriptPath/images/[0-9a-f]/.
The ideal solution would be to use digests. Digests are fast. However, while dump file digests (md5sum) are posted, image file checksums are not. Nor have I seen any digest metadata transported in HTTP headers as would be the case with metalink. So, I run the following:
(...) Digests are available in the db (see below), except a few errors as mentioned.
- INVALID FILES
Issue: Many downloaded image files, upon closer inspection, turnout to be error messages produced by a nearby web caching proxy. A copy of one such message is given in:
Appendix E.19.2 Filename Issues
Countermeasures: After downloading, all image files smaller than 1K are grepped for "302 Redirected". Bad image files are sequestered to $wgScriptPath/images/bad-images/[0-9a-f]/ for later manual inspection.
- SHA1SUM
I read with interest your remarks below about SHA1. I am aware of the img_sha1 field in the images table. However, I am unable to reproduce the values that I find there. As a concrete example, consider the file 'Arc_en_ciel.png' which appears in simplewiki:
(shell)$ env printf %s Arc_en_ciel.png | openssl dgst -md5 (stdin)= 00135a44372c142bd509367a9f166733
So far, so good. The file is indeed stored under $wgScriptPath/images/0/00/.
(rootshell)# openssl dgst -sha1 0/00/Arc_en_ciel.png SHA1(0/00/Arc_en_ciel.png)= fd67104be2338dea99e1211be8b6824d3b271c38
(shell)$ mysql --host=localhost --user=root --password Password: .... mysql> SELECT img_sha1,img_name FROM simplewiki.image WHERE img_name='Arc_en_ciel.png'; +---------------------------------+-----------------+ | img_sha1 | img_name | +---------------------------------+-----------------+ | tllx8mwbr31uissi6a9jq86836d6vy0 | Arc_en_ciel.png | +---------------------------------+-----------------+ 1 row in set (0.00 sec)
Not hexadecimal. The data type is VARBINARY(32), so I try conversion:
mysql> select HEX(img_sha1),img_name from simplewiki.image where img_name='Arc_en_ciel.png'; +----------------------------------------------------------------+-----------------+ | HEX(img_sha1) | img_name | +----------------------------------------------------------------+-----------------+ | 746C6C78386D776272333175697373693661396A7138363833366436767930 | Arc_en_ciel.png | +----------------------------------------------------------------+-----------------+ 1 row in set (0.14 sec)
Still not a match.
Perhaps you could help me understand how these digests are computed.
Those are sha1 in base-36. You will need to convert from base-36 to base-16 to get the “classical output”.