Brion has recently added code to store articles in 'old' SQL table in compressed format, so I will need to adjust the scripts for the international stats.
I spent several hours on it, and despite some useful tips from Brion I can't get those article data inflated, all I get is a Z_DATA_ERROR (-3)
Brion sent me a small sample of the articles in the fr: 'old' dump in compressed raw format, without escape sequences and other fields, just article data. Even this I could not tackle.
Brion wrote:
Here's a zip file containing the raw bytes of compressed old_text from
the first up to 100 columns in the table:
http://leuksman.com/misc/raw.zip
They do decompress with gzdeflate() in PHP.
Here is my test script #!/usr/bin/perl
use CGI::Carp qw(fatalsToBrowser); use Compress::Zlib;
$path = "raw/" ; ($refinf, $status) = inflateInit();
for ($i = 1 ; $i <= 100 ; $i++) { &ReadFile ($i) ; }
exit ;
sub ReadFile { $file_in = $path . "old-" . $i . ".raw" ;
open "FILE_IN", "<", $file_in || die ("Input file " . $file_in . " could not be opened.") ; binmode FILE_IN ; $article = "" ; while ($line = <FILE_IN>) { chomp ($line) ; $article .= $line ; }
($article2, $status) = $refinf->inflate ($article) ; if ($status == Z_OK) # Z_OK = 0 { print "$i:OK: " . substr ($article2,0,50) . "\n" ; } else { print "$i:Unzip error: $status\n" ; } # Z_DATA_ERROR = -3 }
Can someone help me out with this? I can deflate/inflate dummy texts, so libraries are all in place. (I use ActivePerl 5.8, on Windows) ------------------------------------------------------------- There is a second problem, possibly trivial after problem above has been solved: (well actually I hope the problem above is a trivial oversight of mine too)
The SQL dump contains escape sequences: A small section of fr: old dump, new style, that Brion sent me contains \Z: 3541 times \: 3497 ": 3428 \n: 3598 \r: 3550 \0: 3190
\Z is not listed on http://www.mysql.com/doc/en/String_syntax.html I could not find any other doc referring to it. \z is listed, so maybe upper/lower makes no difference, but I doubt it.
Anyone encountered this before?
Thanks for any help.
Erik Zachte
On Jan 16, 2004, at 09:22, Erik Zachte wrote:
Brion has recently added code to store articles in 'old' SQL table in compressed format, so I will need to adjust the scripts for the international stats.
I spent several hours on it, and despite some useful tips from Brion I can't get those article data inflated, all I get is a Z_DATA_ERROR (-3)
Ok, couple of problems: Using chomp on the lines corrupts the input data. Taking the chomp out fixes that.
The same inflation stream gets reused for every file. That doesn't look right to me.
inflate() returns Z_STREAM_END when it gets to the end, not Z_OK. If you get Z_OK, you have to ask it to run more data, or something...? Not entirely sure how to handle that.
And the real hard one to figure out: PHP's gzdeflate(), gzinflate() set the window size to -MAX_WSIZE which disables the checksum. Unless you pass the same parameter to inflateInit(), the perl-side functions will expect those extra bytes and fail without explanation.
Attached is a version that mostly works, except it doesn't handle the Z_OK case right.
-- brion vibber (brion @ pobox.com)
inflate() returns Z_STREAM_END when it gets to the end, not Z_OK. If you get Z_OK, you have to ask it to run more data, or something...? Not entirely sure how to handle that.
See attachment.
On Jan 17, 2004, at 03:06, Peter Gervai wrote:
inflate() returns Z_STREAM_END when it gets to the end, not Z_OK. If you get Z_OK, you have to ask it to run more data, or something...? Not entirely sure how to handle that.
See attachment.<c2.pl>
That's definitely cleaner code, thanks!
However it fails on old-59.raw and old-60.raw; both of these are short files but seem to end with Z_OK rather than Z_STREAM_END...
PHP's gzinflate() works without complaint on both, and the results look correct.
Here's the uncompressed text: http://fr.wikipedia.org/w/wiki.phtml?title=Alsace&action=edit&oldid=... http://fr.wikipedia.org/w/wiki.phtml?title=Alsace&action=edit&oldid=...
The compressed test set is here: http://leuksman.com/misc/raw.zip
-- brion vibber (brion @ pobox.com)
wikitech-l@lists.wikimedia.org