dump format - Wikitech-l

5 Jun 2005

Not only history blobs can benefit from splitting revision
texts into sections and sorting them. The sizes of XML exported
pages (with complete page histories) can also be reduced.

This is the current structure (only relevant tags):

<page>
  <revision><text>text0</text></revision>
  <revision><text>text1</text></revision>
</page>

This would be the new structure:

<page>
  <section>sectiontext0</section>
  <section>sectiontext1</section>
  <section>sectiontext2</section>
  <revision><text type="sectionlist">0
1</text></revision>
  <revision><text type="sectionlist">0
2</text></revision>
</page>

Maybe it's not a problem for the wikimedia servers to store
50 or 100GB of badly compressed history blobs, but I guess
most people who download the dumps wish for smaller file
sizes. I'll write the code if there is consensus on such a
format.

-- 
Geschenkt: 3 Monate GMX ProMail gratis + 3 Ausgaben stern gratis
++ Jetzt anmelden & testen ++ http://www.gmx.net/de/go/promail ++