Historically, there are several problems with the cur & old SQL dumps
we've provided:
* It's hard to import a dump from MySQL into another DBMS like
PostgreSQL without some filtering.
* Reading the SQL dumps directly into external tools, like Erik Zachte's
statistics scripts, is a pain in the butt.
* Compressed text storage makes it even harder to use the dumps outside
of MediaWiki, or even in if PHP is missing the zlib module.
The new schema in MediaWiki 1.5 exacerbates these:
* No easy way to get only current revisions from raw table dumps without
downloading a lot of extra junk.
* Some text may not even *be* there -- conversion will add references to
the leftover cur table, and we expect to start storing bulk text outside
the database entirely at some point.
* Deleted text isn't automatically removed from the text table, so a raw
dump is not safe for distribution.
For some time I've planned to replace the cur+old dumps with a dump
using the Special:Export XML stream format[1]. This is a simple XML
wrapper around the page/revision model which reflects the data in our
schema without being tied to the actual table layout itself.
This should be more accessible to external tools: most programming
language environments provide a stream-friendly XML parser and can
easily slurp in and process the data without a lot of hacking.
The XML dump is also smaller: compressed blobs are opaque to the SQL
dumps, but here they're expanded and recompressed more efficiently. In
my testing, the gzipped full XML dump of
nl.wikipedia.org is about 1/3
smaller than the gzipped SQL cur+old dumps. (A cur-only dump is about
the same size compressed as the SQL version.)
Dumps can be generated with maintenance/dumpBackup.php, which I've just
checked in. It can create both full-history and current-only dumps.
I still need to finish up an importer script using the Special:Import
framework. Also, a standalone .xml.gz->to->database importer tool might
be a useful tool / demo / exercise.
[1]
http://meta.wikimedia.org/wiki/Help:Export
Note that final dumps will include page, revision, and user id numbers
which are not reflected in the old sample markup at that page.
-- brion vibber (brion @
pobox.com)