Hi,
I couldn't find a forum to ask for help other than this one (although I am aware that it might be somewhat off-topic, sorry, I am desperated after 2 days of unsuccessful trials right now).
I downloaded the english version of the whole wikipedia-history (some 10GB) and installed it in a MySQL-database. The data is present. I can access it by e.g. "mysql -> SELECT old_title FROM old;" or with a perl-script that opens the database via DBI, executes any SQL-script and get the results on my screen. On this side no problem. What I can't access is the actual (uncompressed, human readable) content of the old_text-column itself. It's compressed, ok. I got that from the php-source of mediawiki. But what's wrong with the following code?
my $sth = $dbh->prepare("SELECT old_title, old_text FROM old"); $sth->execute; while (my @row = $sth->fetchrow_array ) { if(defined $row[0]) { print $row[0] . " "; } if(defined $row[1]) { my $a = uncompress($row[1]); # dummy variable print $a . "\n"; } }
$a is not defined after this call. If I implement another possibility of Compress::Zlib like:
sub decompressText { my $d; my $status; my $out; my $out2; ($d, $status) = deflateInit(); #-Level => Z_BEST_COMPRESSION); $status == Z_OK or die "INIT failed\n" ; ($out, $status) = $d->deflate($_[0]) ; print "STATUS " . $status ."\n"; $status == Z_OK or die "DEFLATE failed\n" ; ($out2, $status) = $d->flush() ; $status == Z_OK or die "FLUSH failed\n" ; if(defined $out) { print "DEFINED.\n";} if(defined $out2) { print "DEFINED2.\n";} my $z = $out . $out2; return $z; }
I just get decompressText(old_text) to be the binary-stuff I got with the simple SELECT-statement without any (de)compression at all. According to the perldoc both failures indicate that the uncompression wasn't successful.
So my question is: how do I uncompress "old_text" in the table "old"????
Please, please help. I tried it for 2 days and I am pretty sure the error is obvious.
THANK YOU!
KaHa242
Kay Hamacher wrote:
What I can't access is the actual (uncompressed, human readable) content of the old_text-column itself. It's compressed, ok. I got that from the php-source of mediawiki. But what's wrong with the following code?
[snip]
($d, $status) = deflateInit(); #-Level => Z_BEST_COMPRESSION); $status == Z_OK or die "INIT failed\n" ; ($out, $status) = $d->deflate($_[0]) ;
[snip]
Two things: first, deflate() does the compression, so you want to use inflate() to decompress. (The 'de' is confusing, I slip up on that all the time too! Poor naming of the functions...)
Second, you have to match the settings that PHP's gzdeflate() function used to compress them, namely setting the window bits size to -MAX_WSIZE. This disables the checksum bytes, I think, which confuses the decompression unless you give it the same setting.
See this thread for some sample code: http://mail.wikipedia.org/pipermail/wikitech-l/2004-January/007989.html
-- brion vibber (brion @ pobox.com)
wikitech-l@lists.wikimedia.org