Two minor changes in the process for the data dumps I started earlier today:
* The intermediate "stub" XML dumps are now available for download instead of vanishing into a temporary directory. These contain all the article and revision metadata but not the revision text.
* The .7z version of the full-history dump is now built after the .bz2 completes instead of both at the same time; this should make the .bz2 versions of the big wikis available for download sooner as it won't have to wait on the slower 7-zip compressor. (It's still using the slow single-threaded bzip2 for the moment, though.)
The stub dumps are the same format as the full XML dumps, with the exception that the <text> element is empty. It has an id attribute (not listed in the XML Schema file, so don't enforce schema validation in your parser) indicating the internal storage node which contains that revision's text.
This node number isn't really useful unless you're on our servers as the raw storage tables are not accessible from outside, but if you want to do statistics dealing with the rest of the metadata fields it's going to be a lot faster to deal with these lighter-weight files than the version with full text embedded.
These are compressed with gzip for speed; the stub dump for English Wikipedia full history runs about 2 gigabytes compressed.
-- brion vibber (brion @ pobox.com)
wikitech-l@lists.wikimedia.org