Brion Vibber:
<page>
<section>sectiontext0</section>
<section>sectiontext1</section>
<section>sectiontext2</section>
<revision><text type="sectionlist">0
1</text></revision>
<revision><text type="sectionlist">0
2</text></revision>
</page>
Can you show that this does significantly better than gzip?
I don't know if this alone does better than gzip. The output
is meant to be compressed with gzip of course. And gzip
compresses this much better than a stream of complete revision
texts.
I've tested it with the dumps of the German Wikipedia. The
results are here:
http://meta.wikimedia.org/wiki/User:El/History_compression
On average the total size of compressed revision texts can be
reduced to (not by) 18.5%. As the complete dumps include other
information as well (user, timestamp ...) that don't benefit
from my method, I guess the final sizes will be around 1/4.
The window size of the deflate function is the main cause for
this huge difference. Its maximum value is 32kB, but many pages -
especially discussion pages - are larger. So you must bring
matching regions closer together. Splitting files by section
and sorting sections of several revisions by section heading
does exactly this. (And additionally one can omit unchanged
sections.)
Certainly it
won't simplify dump processing.
Yes, but it's not very complicated. The program just needs
to keep some sections in memory and concatenate them.
--
Geschenkt: 3 Monate GMX ProMail gratis + 3 Ausgaben stern gratis
++ Jetzt anmelden & testen ++
http://www.gmx.net/de/go/promail ++