Re: [Xmldatadumps-l] [WP-MIRROR] Questions regarding Metalink and SPDY

2 Jan 2013

Dear Platonides,

Happy New Year.  Thank you for your email.

Here is a sketch of the issues with image files:

1) IMAGE FILE NAME

Issue:  The enwiki has many image files which names contain control
characters (ampersand, asterisk, backquote, percent, question, etc.).
The problem may be illustrated by trying the following slightly
hazardous examples:

(shell)$ curl -O http://star*.jpg
(shell)$ curl -O http://foo`ls`bar.jpg

Countermeasures:  a) When WP-MIRROR 0.5 scrapes the dump for image
file names, such file names are immediately dropped from further
consideration; and b) After downloading is complete, I sequester any
image files which name contains a percent, because such file names
cause `rebuildImages.php --missing', to fail.

2) PARTIAL DOWNLOADS

Issue:  HTTP is not very reliable.  I see thousands of partially
downloaded image files littering $wgScriptPath/images/[0-9a-f]/.

The ideal solution would be to use digests.  Digests are fast.
However, while dump file digests (md5sum) are posted, image file
checksums are not.  Nor have I seen any digest metadata transported in
HTTP headers as would be the case with metalink.  So, I run the
following:

Countermeasures:  WP-MIRROR 0.5 and prior validate all image files by executing:

(shell)$ gm identify -verbose <filename> 2>&1 | grep identify

For most files, this test produces no output.  When there is output,
it is always an error message.  There is a collection of such messages
in the WP-MIRROR 0.5 Reference Manual
<http://www.nongnu.org/wp-mirror/manual/wp-mirror-0.5.pdf>.  See in
particular:

Appendix E.14 Experiments with Corrupt Images
Appendix E.19 Messages

Bad image files are sequestered to
$wgScriptPath/images/bad-images/[0-9a-f]/ for later manual inspection.
 Running `gm identify -verbose' is the time consuming step that I
mentioned in a previous e-mail; and it is one that I would like to
obviate if possible.

3) INVALID FILES

Issue:  Many downloaded image files, upon closer inspection, turnout
to be error messages produced by a nearby web caching proxy.  A copy
of one such message is given in:

Appendix E.19.2 Filename Issues

Countermeasures:  After downloading, all image files smaller than 1K
are grepped for "302 Redirected".  Bad image files are sequestered to
$wgScriptPath/images/bad-images/[0-9a-f]/ for later manual inspection.

4) SHA1SUM

I read with interest your remarks below about SHA1.  I am aware of the
img_sha1 field in the images table.  However, I am unable to reproduce
the values that I find there.  As a concrete example, consider the
file 'Arc_en_ciel.png' which appears in simplewiki:

(shell)$ env printf %s Arc_en_ciel.png | openssl dgst -md5
(stdin)= 00135a44372c142bd509367a9f166733

So far, so good.  The file is indeed stored under $wgScriptPath/images/0/00/.

(rootshell)# openssl dgst -sha1 0/00/Arc_en_ciel.png
SHA1(0/00/Arc_en_ciel.png)= fd67104be2338dea99e1211be8b6824d3b271c38

(shell)$ mysql --host=localhost --user=root --password
Password:
....
mysql> SELECT img_sha1,img_name FROM simplewiki.image WHERE
img_name='Arc_en_ciel.png';
+---------------------------------+-----------------+
| img_sha1                        | img_name        |
+---------------------------------+-----------------+
| tllx8mwbr31uissi6a9jq86836d6vy0 | Arc_en_ciel.png |
+---------------------------------+-----------------+
1 row in set (0.00 sec)

Not hexadecimal.  The data type is VARBINARY(32), so I try conversion:

mysql> select HEX(img_sha1),img_name from simplewiki.image where
img_name='Arc_en_ciel.png';
+----------------------------------------------------------------+-----------------+
| HEX(img_sha1)                                                  |
img_name        |
+----------------------------------------------------------------+-----------------+
| 746C6C78386D776272333175697373693661396A7138363833366436767930 |
Arc_en_ciel.png |
+----------------------------------------------------------------+-----------------+
1 row in set (0.14 sec)

Still not a match.

Perhaps you could help me understand how these digests are computed.

Sincerely Yours,
Kent

On 1/2/13, Platonides &lt;platonides(a)gmail.com&gt; wrote:
...
  Ob 02/01/13 17:30, wp mirror wrote:
  2) METALINK

 WP-MIRROR 0.5 and prior version, had to deal with thousands of corrupt
 image files.  Most of these were partial downloads.  cURL would
 time-out and leave corrupt files.  I currently deal with that by
 validating the images.  Validation, however, consumes a lot of time.
 So I am looking for ways to improve the reliability of downloading. 
 What do you mean by the “file validation process” ?

 You can check the download against the sha1 it should have (in most
 cases, there are a few files with broken hashes, or missing revisions...).

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Re: [Xmldatadumps-l] [WP-MIRROR] Questions regarding Metalink and SPDY