So for de.wikipedia the article dump is reduced by a factor of 3, while the complete dump is reduced almost by a factor of 7 (do numbers in [4] refer to the "cur" table or cur+old?).
VERY good. It's important that this new dump format is clearly documented for people who write offline readers, or that a reference implementation exists somewhere. Is
http://meta.wikimedia.org/wiki/User:El/History_compression/Blob_layout
the current split-revisions format?
Alfio
On Wed, 22 Jun 2005 elwp@gmx.de wrote:
I've uploaded a new version of dumpBackup.php to Bugzilla (#2310 [1]) that understands the following options:
--splitrevisions : Split revisions in sections for better compression. (see [2] and [3])
--usebackrefs : Another optimisation that enhances compression. (... and makes the dump slightly more complicated.) Only meaningful with the --splitrevisions option.
--namespaces=n,m,... : Dump only the given namespaces. (Can be used to dump "encyclopedic" and "non-encyclopedic" content separately.)
--day=yyyymmdd : Dump revisions of that day only. (For dayly incremental dumps.) Note that this is pretty useless unless regular dumps of the log table (with page deletions and moves) are made available.
Dumping with --splitrevisions and --usebackrefs compresses more than 8 times better than without these options, and it is faster. (German Wikipedia; for details see [4]).
In SpecialExport.php I added 4 options ('splitrevisions', 'usebackrefs', 'limit' and 'newerthan'). The rationale for 'limit' and 'newerthan' is described in bug #1748 [5]. Maybe the corresponding GUI elements can be removed because the options are mainly useful for download scripts.
I also tried to write a new XML schema, but I don't think it is the best solution. I'd be glad if someone could write a better one. (It should specify e.g. that in
<section>abc</section> abc is a string and in <section type="backref">0</section> 0 is an integer. I don't know how to do this.)
[1] http://bugzilla.wikipedia.org/show_bug.cgi?id=2310 [2] http://mail.wikipedia.org/pipermail/wikitech-l/2005-June/030001.html [3] http://mail.wikipedia.org/pipermail/wikitech-l/2005-June/030047.html [4] http://meta.wikimedia.org/w/index.php?title=User:El/dumpBackup.php [5] http://bugzilla.wikipedia.org/show_bug.cgi?id=1748
-- Weitersagen: GMX DSL-Flatrates mit Tempo-Garantie! Ab 4,99 Euro/Monat: http://www.gmx.net/de/go/dsl _______________________________________________ Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l