Re: [Wikitech-l] Changing XML Wikipedia Schema to Enable Smaller Incremental Dumps that are Hadoop ready

24 Aug 2011

On Thu, Aug 18, 2011 at 10:30 AM, Diederik van Liere &lt;dvanliere(a)gmail.com&gt;wrote;wrote:

...
  1. Denormalization of the schema
 Instead of having a <page> tag with multiple <revision> tags, I
 propose to just have <revision> tags. Each <revision> tag would
 include a <page_id>, <page_title>, <page_namespace> and
 <page_redirect> tag. This denormalization would make it much easier to
 build an incremental dump utility. You only need to keep track of the
 final revision of each article at the moment of dump creation and then
 you can create a new incremental dump continueing from the last dump.
 It would also easier to restore a dump process that crashed. 

page title/namespace and redirect-ness are not fixed to a revision, and may
change over time. This means that simply knowing the last revision you left
off at doesn't give you enough information for a continuation point; you'd
have to go back and see if any revisions have been deleted or had their
pages' title, redirectness, or other properties have changed.

I think it may be better to abandon the "single XML stream" data model and
allow for structure and random-access. A directory tree with separate files
for various pages/revisions may be a lot easier to produce & update
"in-place", and could be downloaded & resynced with standard tools like
rsync or a custom tool that optimizes what files it looks for.

There's basically a couple different problems to solve:

1) Building a complete data set and getting that out to people

2) Updating an existing data set with new data

3) Processing a data set in some useful way

Generating the initial dump today is super expensive -- because it's a
single compressed XML stream, we have to copy and re-copy most of the same
data over, and over, and over.

And today there's no good way to just "apply" an incremental dump on top of
your existing download.

3. Smaller dump sizes
...
  The dump files continue to grow as the text of each
revision is stored
 in the XML file. Currently, the uncompressed XML dump files of the
 English Wikipedia are about 5.5Tb in size and this will only continue
 to grow. An alternative would be to replace the <text> tag with a
 <text_added> and <text_removed> tags. A page can still be
 reconstructed by patching multiple <text_added> and <text_removed>
 tags. We can provide a simple script / tool that would reconstruct the
 full text of an article up to a particular date / revision id. This
 has two advantages:
 1) The dump files will be significantly smaller
 2) It will be easier and faster to analyze the types of edits. Who is
 adding a template, who is wikifying an edit, who is fixing spelling
 and grammar mistakes.

 Broadly speaking some sort of diff storage makes a lot of sense; especially
if it doesn't require reproducing those diffs all the time. :)

But be warned that there are different needs and different ways of
processing data; diffs again interfere with random access, as you need to be
able to fetch adjacent items to reproduce the text. If you're just trundling
along through the entire dump and applying diffs as you go to reconstruct
the text, then you're basically doing what you already do when doing
on-the-fly decompression of the .xml.bz2 or .xml.7z -- it may, or may not,
actually save you anything for this case.

Of course if all you really wanted was the diff, then obviously that's going
to help you. :)

-- brion

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Changing XML Wikipedia Schema to Enable Smaller Incremental Dumps that are Hadoop ready