So for de.wikipedia the article dump is reduced by a factor of 3, while
the complete dump is reduced almost by a factor of 7 (do numbers in [4]
refer to the "cur" table or cur+old?).
VERY good. It's important that this new dump format is clearly documented
for people who write offline readers, or that a reference implementation
exists somewhere. Is
http://meta.wikimedia.org/wiki/User:El/History_compression/Blob_layout
the current split-revisions format?
Alfio
On Wed, 22 Jun 2005 elwp(a)gmx.de wrote:
I've uploaded a new version of dumpBackup.php to
Bugzilla (#2310 [1]) that understands the following
options:
--splitrevisions :
Split revisions in sections for better
compression. (see [2] and [3])
--usebackrefs :
Another optimisation that enhances compression.
(... and makes the dump slightly more
complicated.) Only meaningful with the
--splitrevisions option.
--namespaces=n,m,... :
Dump only the given namespaces. (Can be used to
dump "encyclopedic" and "non-encyclopedic" content
separately.)
--day=yyyymmdd :
Dump revisions of that day only. (For dayly
incremental dumps.) Note that this is pretty
useless unless regular dumps of the log table
(with page deletions and moves) are made
available.
Dumping with --splitrevisions and --usebackrefs compresses
more than 8 times better than without these options, and it
is faster. (German Wikipedia; for details see [4]).
In SpecialExport.php I added 4 options ('splitrevisions',
'usebackrefs', 'limit' and 'newerthan'). The rationale for
'limit' and 'newerthan' is described in bug #1748 [5]. Maybe
the corresponding GUI elements can be removed because the
options are mainly useful for download scripts.
I also tried to write a new XML schema, but I don't think
it is the best solution. I'd be glad if someone could
write a better one. (It should specify e.g. that in
<section>abc</section> abc is a string and in
<section type="backref">0</section> 0 is an integer. I
don't know how to do this.)
[1]
http://bugzilla.wikipedia.org/show_bug.cgi?id=2310
[2]
http://mail.wikipedia.org/pipermail/wikitech-l/2005-June/030001.html
[3]
http://mail.wikipedia.org/pipermail/wikitech-l/2005-June/030047.html
[4]
http://meta.wikimedia.org/w/index.php?title=User:El/dumpBackup.php
[5]
http://bugzilla.wikipedia.org/show_bug.cgi?id=1748
--
Weitersagen: GMX DSL-Flatrates mit Tempo-Garantie!
Ab 4,99 Euro/Monat:
http://www.gmx.net/de/go/dsl
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)wikimedia.org
http://mail.wikipedia.org/mailman/listinfo/wikitech-l