I've uploaded a new version of dumpBackup.php to Bugzilla (#2310 [1]) that understands the following options:
--splitrevisions : Split revisions in sections for better compression. (see [2] and [3])
--usebackrefs : Another optimisation that enhances compression. (... and makes the dump slightly more complicated.) Only meaningful with the --splitrevisions option.
--namespaces=n,m,... : Dump only the given namespaces. (Can be used to dump "encyclopedic" and "non-encyclopedic" content separately.)
--day=yyyymmdd : Dump revisions of that day only. (For dayly incremental dumps.) Note that this is pretty useless unless regular dumps of the log table (with page deletions and moves) are made available.
Dumping with --splitrevisions and --usebackrefs compresses more than 8 times better than without these options, and it is faster. (German Wikipedia; for details see [4]).
In SpecialExport.php I added 4 options ('splitrevisions', 'usebackrefs', 'limit' and 'newerthan'). The rationale for 'limit' and 'newerthan' is described in bug #1748 [5]. Maybe the corresponding GUI elements can be removed because the options are mainly useful for download scripts.
I also tried to write a new XML schema, but I don't think it is the best solution. I'd be glad if someone could write a better one. (It should specify e.g. that in <section>abc</section> abc is a string and in <section type="backref">0</section> 0 is an integer. I don't know how to do this.)
[1] http://bugzilla.wikipedia.org/show_bug.cgi?id=2310 [2] http://mail.wikipedia.org/pipermail/wikitech-l/2005-June/030001.html [3] http://mail.wikipedia.org/pipermail/wikitech-l/2005-June/030047.html [4] http://meta.wikimedia.org/w/index.php?title=User:El/dumpBackup.php [5] http://bugzilla.wikipedia.org/show_bug.cgi?id=1748
So for de.wikipedia the article dump is reduced by a factor of 3, while the complete dump is reduced almost by a factor of 7 (do numbers in [4] refer to the "cur" table or cur+old?).
VERY good. It's important that this new dump format is clearly documented for people who write offline readers, or that a reference implementation exists somewhere. Is
http://meta.wikimedia.org/wiki/User:El/History_compression/Blob_layout
the current split-revisions format?
Alfio
On Wed, 22 Jun 2005 elwp@gmx.de wrote:
I've uploaded a new version of dumpBackup.php to Bugzilla (#2310 [1]) that understands the following options:
--splitrevisions : Split revisions in sections for better compression. (see [2] and [3])
--usebackrefs : Another optimisation that enhances compression. (... and makes the dump slightly more complicated.) Only meaningful with the --splitrevisions option.
--namespaces=n,m,... : Dump only the given namespaces. (Can be used to dump "encyclopedic" and "non-encyclopedic" content separately.)
--day=yyyymmdd : Dump revisions of that day only. (For dayly incremental dumps.) Note that this is pretty useless unless regular dumps of the log table (with page deletions and moves) are made available.
Dumping with --splitrevisions and --usebackrefs compresses more than 8 times better than without these options, and it is faster. (German Wikipedia; for details see [4]).
In SpecialExport.php I added 4 options ('splitrevisions', 'usebackrefs', 'limit' and 'newerthan'). The rationale for 'limit' and 'newerthan' is described in bug #1748 [5]. Maybe the corresponding GUI elements can be removed because the options are mainly useful for download scripts.
I also tried to write a new XML schema, but I don't think it is the best solution. I'd be glad if someone could write a better one. (It should specify e.g. that in
<section>abc</section> abc is a string and in <section type="backref">0</section> 0 is an integer. I don't know how to do this.)
[1] http://bugzilla.wikipedia.org/show_bug.cgi?id=2310 [2] http://mail.wikipedia.org/pipermail/wikitech-l/2005-June/030001.html [3] http://mail.wikipedia.org/pipermail/wikitech-l/2005-June/030047.html [4] http://meta.wikimedia.org/w/index.php?title=User:El/dumpBackup.php [5] http://bugzilla.wikipedia.org/show_bug.cgi?id=1748
-- Weitersagen: GMX DSL-Flatrates mit Tempo-Garantie! Ab 4,99 Euro/Monat: http://www.gmx.net/de/go/dsl _______________________________________________ Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Alfio Puglisi:
So for de.wikipedia the article dump is reduced by a factor of 3, while the complete dump is reduced almost by a factor of 7 (do numbers in [4] refer to the "cur" table or cur+old?).
In Mediawiki 1.5 cur and old are combined. The numbers refer to a gzipped XML dump of the complete page histories. (And the factor is 8.5, not "almost 7". :-)
VERY good. It's important that this new dump format is clearly documented for people who write offline readers, or that a reference implementation exists somewhere. Is
http://meta.wikimedia.org/wiki/User:El/History_compression/Blob_layout
the current split-revisions format?
No, this page describes an internal format. Revisions can be stored in this format, but they don't need to be. In SpecialExport I use the SplitMergeGzipHistoryBlob class only as a temporary container. Users who intend to use the dumps for their programs don't need to know anything about this class because the dumps will be in XML format.
That is the same format that SpecialExport produces for single pages. I only added the elements <sectiongroup> and <section> and changed the meaning of <text> if it has the attribute type="sectionlist".
<text type="sectionlist">0 3 4</text> means e.g. that the text is composed of the 1st, 4th and 5th section in the previously defined sectiongroup.
A reference implementation is the perl script [1]. You can try it with the example that I've now put on [2].
[1] http://bugzilla.wikipedia.org/attachment.cgi?id=628&action=view [2] http://meta.wikimedia.org/w/index.php?title=User:El/XML_format
wikitech-l@lists.wikimedia.org