Data dump changes - Wikitech-l

19 Jun 2006


      Two minor changes in the process for the data dumps I started earlier today:
* The intermediate "stub" XML dumps are now available for download instead of
vanishing into a temporary directory. These contain all the article and revision
metadata but not the revision text.
* The .7z version of the full-history dump is now built after the .bz2 completes
instead of both at the same time; this should make the .bz2 versions of the big
wikis available for download sooner as it won't have to wait on the slower 7-zip
compressor. (It's still using the slow single-threaded bzip2 for the moment,
though.)
The stub dumps are the same format as the full XML dumps, with the exception
that the <text> element is empty. It has an id attribute (not listed in the XML
Schema file, so don't enforce schema validation in your parser) indicating the
internal storage node which contains that revision's text.
This node number isn't really useful unless you're on our servers as the raw
storage tables are not accessible from outside, but if you want to do statistics
dealing with the rest of the metadata fields it's going to be a lot faster to
deal with these lighter-weight files than the version with full text embedded.
These are compressed with gzip for speed; the stub dump for English Wikipedia
full history runs about 2 gigabytes compressed.
-- brion vibber (brion @ pobox.com)