Alfio Puglisi:
So for de.wikipedia the article dump is reduced by a factor of 3, while the complete dump is reduced almost by a factor of 7 (do numbers in [4] refer to the "cur" table or cur+old?).
In Mediawiki 1.5 cur and old are combined. The numbers refer to a gzipped XML dump of the complete page histories. (And the factor is 8.5, not "almost 7". :-)
VERY good. It's important that this new dump format is clearly documented for people who write offline readers, or that a reference implementation exists somewhere. Is
http://meta.wikimedia.org/wiki/User:El/History_compression/Blob_layout
the current split-revisions format?
No, this page describes an internal format. Revisions can be stored in this format, but they don't need to be. In SpecialExport I use the SplitMergeGzipHistoryBlob class only as a temporary container. Users who intend to use the dumps for their programs don't need to know anything about this class because the dumps will be in XML format.
That is the same format that SpecialExport produces for single pages. I only added the elements <sectiongroup> and <section> and changed the meaning of <text> if it has the attribute type="sectionlist".
<text type="sectionlist">0 3 4</text> means e.g. that the text is composed of the 1st, 4th and 5th section in the previously defined sectiongroup.
A reference implementation is the perl script [1]. You can try it with the example that I've now put on [2].
[1] http://bugzilla.wikipedia.org/attachment.cgi?id=628&action=view [2] http://meta.wikimedia.org/w/index.php?title=User:El/XML_format