Kevin Carillo wrote:
20050421_old_table.sql.gz ----> around 31 gigabytes and 20050421_old_table.sql ----> 34 201 362 bytes (compression factor of around 1.1)
That's entirely normal, as stored text in the old table is usually compressed.
In the current tables, there are three possible states for a row in the old table.
(default): uncompressed single item. You probably won't find many of these in the Wikipedia dumps.
gzip: An individual text revision compressed with PHP's gzdeflate() function, to be uncompressed with PHP's gzinflate() function. These wrap zlib functions with some specific settings. If you for some reason don't want to use MediaWiki or PHP to retrieve data from the dump, see Erik Zachte's stats script for example Perl code.
object: A serialized PHP object which either contains multiple revisions of a page blobbed and compressed together, or references a particular row in which this revision can be found blobbed and compressed with others. This provides a better overall compression ratio in the database than individual compression. See includes/HistoryBlob.php
gzip and object rows are indicated by the presence of those flags in the old_flags field.
-- brion vibber (brion @ pobox.com)