Re: [Wikitech-l] dump format

6 Jun 2005


      Brion Vibber:
...
...
<page>
  <section>sectiontext0</section>
  <section>sectiontext1</section>
  <section>sectiontext2</section>
  <revision><text type="sectionlist">0 1</text></revision>
  <revision><text type="sectionlist">0 2</text></revision>
</page>
Can you show that this does significantly better than gzip?
I don't know if this alone does better than gzip. The output
is meant to be compressed with gzip of course. And gzip
compresses this much better than a stream of complete revision
texts.
I've tested it with the dumps of the German Wikipedia. The
results are here:
http://meta.wikimedia.org/wiki/User:El/History_compression
On average the total size of compressed revision texts can be
reduced to (not by) 18.5%. As the complete dumps include other
information as well (user, timestamp ...) that don't benefit
from my method, I guess the final sizes will be around 1/4.
The window size of the deflate function is the main cause for
this huge difference. Its maximum value is 32kB, but many pages -
especially discussion pages - are larger. So you must bring
matching regions closer together. Splitting files by section 
and sorting sections of several revisions by section heading
does exactly this. (And additionally one can omit unchanged
sections.)
...
Certainly it
won't simplify dump processing.
Yes, but it's not very complicated. The program just needs
to keep some sections in memory and concatenate them.
-- 
Geschenkt: 3 Monate GMX ProMail gratis + 3 Ausgaben stern gratis
++ Jetzt anmelden & testen ++ http://www.gmx.net/de/go/promail ++

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] dump format