Brion Vibber:
<page> <section>sectiontext0</section> <section>sectiontext1</section> <section>sectiontext2</section> <revision><text type="sectionlist">0 1</text></revision> <revision><text type="sectionlist">0 2</text></revision> </page>
Can you show that this does significantly better than gzip?
I don't know if this alone does better than gzip. The output is meant to be compressed with gzip of course. And gzip compresses this much better than a stream of complete revision texts.
I've tested it with the dumps of the German Wikipedia. The results are here:
http://meta.wikimedia.org/wiki/User:El/History_compression
On average the total size of compressed revision texts can be reduced to (not by) 18.5%. As the complete dumps include other information as well (user, timestamp ...) that don't benefit from my method, I guess the final sizes will be around 1/4.
The window size of the deflate function is the main cause for this huge difference. Its maximum value is 32kB, but many pages - especially discussion pages - are larger. So you must bring matching regions closer together. Splitting files by section and sorting sections of several revisions by section heading does exactly this. (And additionally one can omit unchanged sections.)
Certainly it won't simplify dump processing.
Yes, but it's not very complicated. The program just needs to keep some sections in memory and concatenate them.