Re: [Wikitech-l] Changing XML Wikipedia Schema to Enable Smaller Incremental Dumps that are Hadoop ready

19 Aug 2011

      Sounds all very reasonable.
Some thoughts:
* Having revisions not wrapped into <page> means that for
reconstructing the history of a page, the entire dump has to be
scanned, unless there is an index of all revisions
* Such an index should probably accompany the XML file, ideally if the
XML is in a seekable zip container (bgzip etc.)
* I suggest that the current article version at the time of dump is
stored in full, and not as a diff; if you want to do history, you'll
probably calculate all diffs anyway, but the current version should be
accessible right away
Magnus
On Thu, Aug 18, 2011 at 6:30 PM, Diederik van Liere dvanliere@gmail.com wrote:
...
Hi!
Over the last year, I have been using the Wikipedia XML dumps
extensively. I used it to conduct the Editor Trends Study [0] and me
and the Summer Research Fellows [1] have used it in the last three
months during the Summer of Research. I am proposing some changes to
the current XML schema based on those experiences.
The current XML schema presents a number of challenges for both the
people who are creating dump files as the people who are consuming the
dump files. Challenges include:

The embedded structure of the schema, a single <page> tag with

multiple <revision> tags makes it very hard to develop an incremental
dump utility
2) A lot of post processing is required.
3) By storing the entire text for each revision, the dump files are
getting so large that they become unmanageable for most people.

Denormalization of the schema

Instead of having a <page> tag with multiple <revision> tags, I
propose to just have <revision> tags. Each <revision> tag would
include a <page_id>, <page_title>, <page_namespace> and
<page_redirect> tag. This denormalization would make it much easier to
build an incremental dump utility. You only need to keep track of the
final revision of each article at the moment of dump creation and then
you can create a new incremental dump continueing from the last dump.
It would also easier to restore a dump process that crashed.  Finally,
tools like Hadoop would have a way easier time handling this XML
schema than the current one.

Post-processing of data

Currently, a significant amount of time is required for
post-processing the data. Some examples include:

The title includes the namespace and so to exclude pages from a

particular namespace requires generating a separate namespace
variable. Particularly, focusing on the main namespace is tricky
because that can only be done by checking whether a page does not
belong to any other namespace (see bug
https://bugzilla.wikimedia.org/show_bug.cgi?id=27775).

The <redirect> tag currently is either True or False, more useful

would be the article_id of the page to which a page is redirected.

Revisions within a <page> are sorted by revision_id, but they should

be sorted by timestamp. The current ordering makes it even harder to
generate diffs between two revisions (see bug
https://bugzilla.wikimedia.org/show_bug.cgi?id=27112)

Some useful variables in the MySQL database are not yet exposed in

the XML files. Examples include:
       - Length of revision (part of Mediawiki 1.17)
       - Namespace of article

Smaller dump sizes

The dump files continue to grow as the text of each revision is stored
in the XML file. Currently, the uncompressed XML dump files of the
English Wikipedia are about 5.5Tb in size and this will only continue
to grow. An alternative would be to replace the <text> tag with a
<text_added> and <text_removed> tags. A page can still be
reconstructed by patching multiple <text_added> and <text_removed>
tags. We can provide a simple script / tool that would reconstruct the
full text of an article up to a particular date / revision id. This
has two advantages:

The dump files will be significantly smaller
It will be easier and faster to analyze the types of edits. Who is

adding a template, who is wikifying an edit, who is fixing spelling
and grammar mistakes.

Downsides

This suggestion is obviously not backwards compatible and it might
break some tools out there. I think that the upsides (incremental
backups, Hadoop-ready and smaller sizes) outweigh the downside of
being backwards incompatible. The current way of dump generation
cannot continue forever.
[0] http://strategy.wikimedia.org/wiki/Editor_Trends_Study,
http://strategy.wikimedia.org/wiki/March_2011_Update
[1] http://blog.wikimedia.org/2011/06/01/summerofresearchannouncement/
I would love to hear your thoughts and comments!
Best,
Diederik

Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Changing XML Wikipedia Schema to Enable Smaller Incremental Dumps that are Hadoop ready