Not only history blobs can benefit from splitting revision texts into sections and sorting them. The sizes of XML exported pages (with complete page histories) can also be reduced.
This is the current structure (only relevant tags):
<page> <revision><text>text0</text></revision> <revision><text>text1</text></revision> </page>
This would be the new structure:
<page> <section>sectiontext0</section> <section>sectiontext1</section> <section>sectiontext2</section> <revision><text type="sectionlist">0 1</text></revision> <revision><text type="sectionlist">0 2</text></revision> </page>
Maybe it's not a problem for the wikimedia servers to store 50 or 100GB of badly compressed history blobs, but I guess most people who download the dumps wish for smaller file sizes. I'll write the code if there is consensus on such a format.
elwp@gmx.de wrote:
Not only history blobs can benefit from splitting revision texts into sections and sorting them. The sizes of XML exported pages (with complete page histories) can also be reduced.
This is the current structure (only relevant tags):
<page> <revision><text>text0</text></revision> <revision><text>text1</text></revision> </page>
This would be the new structure:
<page> <section>sectiontext0</section> <section>sectiontext1</section> <section>sectiontext2</section> <revision><text type="sectionlist">0 1</text></revision> <revision><text type="sectionlist">0 2</text></revision> </page>
Can you show that this does significantly better than gzip? Certainly it won't simplify dump processing.
-- brion vibber
Brion Vibber:
<page> <section>sectiontext0</section> <section>sectiontext1</section> <section>sectiontext2</section> <revision><text type="sectionlist">0 1</text></revision> <revision><text type="sectionlist">0 2</text></revision> </page>
Can you show that this does significantly better than gzip?
I don't know if this alone does better than gzip. The output is meant to be compressed with gzip of course. And gzip compresses this much better than a stream of complete revision texts.
I've tested it with the dumps of the German Wikipedia. The results are here:
http://meta.wikimedia.org/wiki/User:El/History_compression
On average the total size of compressed revision texts can be reduced to (not by) 18.5%. As the complete dumps include other information as well (user, timestamp ...) that don't benefit from my method, I guess the final sizes will be around 1/4.
The window size of the deflate function is the main cause for this huge difference. Its maximum value is 32kB, but many pages - especially discussion pages - are larger. So you must bring matching regions closer together. Splitting files by section and sorting sections of several revisions by section heading does exactly this. (And additionally one can omit unchanged sections.)
Certainly it won't simplify dump processing.
Yes, but it's not very complicated. The program just needs to keep some sections in memory and concatenate them.
elwp@gmx.de (elwp@gmx.de) [050606 17:44]:
The window size of the deflate function is the main cause for this huge difference. Its maximum value is 32kB, but many pages - especially discussion pages - are larger. So you must bring matching regions closer together. Splitting files by section and sorting sections of several revisions by section heading does exactly this. (And additionally one can omit unchanged sections.)
You mean, something closer to keeping diffs rather than each revision in its entirety? Have you tested using diff rather than this custom diff-like thing?
- d.
David Gerard:
You mean, something closer to keeping diffs rather than each revision in its entirety? Have you tested using diff rather than this custom diff-like thing?
Diff compresses well ([1], Consecutive forward diffs). And it is probably fast as well, if you want to access all consecutive revisions. I developed the algorithm I proposed primarily for compressing history blobs [2]. This would not be possible with diffs, because if someone wants to get the 20th revision in a blob, the server would have to apply 19 patches, which would be very slow.
I don't know if my SplitMergeGzipHistoryBlobs will be used some day, but if so, the generation of dumps would certainly be faster if the same logical structure would be used.
[1] http://meta.wikimedia.org/wiki/History_compression [2] http://bugzilla.wikipedia.org/show_bug.cgi?id=2310
elwp@gmx.de wrote:
On average the total size of compressed revision texts can be reduced to (not by) 18.5%. As the complete dumps include other information as well (user, timestamp ...) that don't benefit from my method, I guess the final sizes will be around 1/4.
Spiffy.
-- brion vibber (brion @ pobox.com)
wikitech-l@lists.wikimedia.org