Hi!
Over the last year, I have been using the Wikipedia XML dumps extensively. I used it to conduct the Editor Trends Study [0] and me and the Summer Research Fellows [1] have used it in the last three months during the Summer of Research. I am proposing some changes to the current XML schema based on those experiences.
The current XML schema presents a number of challenges for both the people who are creating dump files as the people who are consuming the dump files. Challenges include:
1) The embedded structure of the schema, a single <page> tag with multiple <revision> tags makes it very hard to develop an incremental dump utility 2) A lot of post processing is required. 3) By storing the entire text for each revision, the dump files are getting so large that they become unmanageable for most people.
1. Denormalization of the schema Instead of having a <page> tag with multiple <revision> tags, I propose to just have <revision> tags. Each <revision> tag would include a <page_id>, <page_title>, <page_namespace> and <page_redirect> tag. This denormalization would make it much easier to build an incremental dump utility. You only need to keep track of the final revision of each article at the moment of dump creation and then you can create a new incremental dump continueing from the last dump. It would also easier to restore a dump process that crashed. Finally, tools like Hadoop would have a way easier time handling this XML schema than the current one.
2. Post-processing of data Currently, a significant amount of time is required for post-processing the data. Some examples include: * The title includes the namespace and so to exclude pages from a particular namespace requires generating a separate namespace variable. Particularly, focusing on the main namespace is tricky because that can only be done by checking whether a page does not belong to any other namespace (see bug https://bugzilla.wikimedia.org/show_bug.cgi?id=27775). * The <redirect> tag currently is either True or False, more useful would be the article_id of the page to which a page is redirected. * Revisions within a <page> are sorted by revision_id, but they should be sorted by timestamp. The current ordering makes it even harder to generate diffs between two revisions (see bug https://bugzilla.wikimedia.org/show_bug.cgi?id=27112) * Some useful variables in the MySQL database are not yet exposed in the XML files. Examples include: - Length of revision (part of Mediawiki 1.17) - Namespace of article
3. Smaller dump sizes The dump files continue to grow as the text of each revision is stored in the XML file. Currently, the uncompressed XML dump files of the English Wikipedia are about 5.5Tb in size and this will only continue to grow. An alternative would be to replace the <text> tag with a <text_added> and <text_removed> tags. A page can still be reconstructed by patching multiple <text_added> and <text_removed> tags. We can provide a simple script / tool that would reconstruct the full text of an article up to a particular date / revision id. This has two advantages: 1) The dump files will be significantly smaller 2) It will be easier and faster to analyze the types of edits. Who is adding a template, who is wikifying an edit, who is fixing spelling and grammar mistakes.
4. Downsides This suggestion is obviously not backwards compatible and it might break some tools out there. I think that the upsides (incremental backups, Hadoop-ready and smaller sizes) outweigh the downside of being backwards incompatible. The current way of dump generation cannot continue forever.
[0] http://strategy.wikimedia.org/wiki/Editor_Trends_Study, http://strategy.wikimedia.org/wiki/March_2011_Update [1] http://blog.wikimedia.org/2011/06/01/summerofresearchannouncement/
I would love to hear your thoughts and comments!
Best, Diederik
Sounds all very reasonable.
Some thoughts: * Having revisions not wrapped into <page> means that for reconstructing the history of a page, the entire dump has to be scanned, unless there is an index of all revisions * Such an index should probably accompany the XML file, ideally if the XML is in a seekable zip container (bgzip etc.) * I suggest that the current article version at the time of dump is stored in full, and not as a diff; if you want to do history, you'll probably calculate all diffs anyway, but the current version should be accessible right away
Magnus
On Thu, Aug 18, 2011 at 6:30 PM, Diederik van Liere dvanliere@gmail.com wrote:
Hi!
Over the last year, I have been using the Wikipedia XML dumps extensively. I used it to conduct the Editor Trends Study [0] and me and the Summer Research Fellows [1] have used it in the last three months during the Summer of Research. I am proposing some changes to the current XML schema based on those experiences.
The current XML schema presents a number of challenges for both the people who are creating dump files as the people who are consuming the dump files. Challenges include:
- The embedded structure of the schema, a single <page> tag with
multiple <revision> tags makes it very hard to develop an incremental dump utility 2) A lot of post processing is required. 3) By storing the entire text for each revision, the dump files are getting so large that they become unmanageable for most people.
- Denormalization of the schema
Instead of having a <page> tag with multiple <revision> tags, I propose to just have <revision> tags. Each <revision> tag would include a <page_id>, <page_title>, <page_namespace> and <page_redirect> tag. This denormalization would make it much easier to build an incremental dump utility. You only need to keep track of the final revision of each article at the moment of dump creation and then you can create a new incremental dump continueing from the last dump. It would also easier to restore a dump process that crashed. Finally, tools like Hadoop would have a way easier time handling this XML schema than the current one.
- Post-processing of data
Currently, a significant amount of time is required for post-processing the data. Some examples include:
- The title includes the namespace and so to exclude pages from a
particular namespace requires generating a separate namespace variable. Particularly, focusing on the main namespace is tricky because that can only be done by checking whether a page does not belong to any other namespace (see bug https://bugzilla.wikimedia.org/show_bug.cgi?id=27775).
- The <redirect> tag currently is either True or False, more useful
would be the article_id of the page to which a page is redirected.
- Revisions within a <page> are sorted by revision_id, but they should
be sorted by timestamp. The current ordering makes it even harder to generate diffs between two revisions (see bug https://bugzilla.wikimedia.org/show_bug.cgi?id=27112)
- Some useful variables in the MySQL database are not yet exposed in
the XML files. Examples include: - Length of revision (part of Mediawiki 1.17) - Namespace of article
- Smaller dump sizes
The dump files continue to grow as the text of each revision is stored in the XML file. Currently, the uncompressed XML dump files of the English Wikipedia are about 5.5Tb in size and this will only continue to grow. An alternative would be to replace the <text> tag with a <text_added> and <text_removed> tags. A page can still be reconstructed by patching multiple <text_added> and <text_removed> tags. We can provide a simple script / tool that would reconstruct the full text of an article up to a particular date / revision id. This has two advantages:
- The dump files will be significantly smaller
- It will be easier and faster to analyze the types of edits. Who is
adding a template, who is wikifying an edit, who is fixing spelling and grammar mistakes.
- Downsides
This suggestion is obviously not backwards compatible and it might break some tools out there. I think that the upsides (incremental backups, Hadoop-ready and smaller sizes) outweigh the downside of being backwards incompatible. The current way of dump generation cannot continue forever.
[0] http://strategy.wikimedia.org/wiki/Editor_Trends_Study, http://strategy.wikimedia.org/wiki/March_2011_Update [1] http://blog.wikimedia.org/2011/06/01/summerofresearchannouncement/
I would love to hear your thoughts and comments!
Best, Diederik
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Thu, Aug 18, 2011 at 10:30 AM, Diederik van Liere dvanliere@gmail.comwrote:
- Denormalization of the schema
Instead of having a <page> tag with multiple <revision> tags, I propose to just have <revision> tags. Each <revision> tag would include a <page_id>, <page_title>, <page_namespace> and <page_redirect> tag. This denormalization would make it much easier to build an incremental dump utility. You only need to keep track of the final revision of each article at the moment of dump creation and then you can create a new incremental dump continueing from the last dump. It would also easier to restore a dump process that crashed.
page title/namespace and redirect-ness are not fixed to a revision, and may change over time. This means that simply knowing the last revision you left off at doesn't give you enough information for a continuation point; you'd have to go back and see if any revisions have been deleted or had their pages' title, redirectness, or other properties have changed.
I think it may be better to abandon the "single XML stream" data model and allow for structure and random-access. A directory tree with separate files for various pages/revisions may be a lot easier to produce & update "in-place", and could be downloaded & resynced with standard tools like rsync or a custom tool that optimizes what files it looks for.
There's basically a couple different problems to solve:
1) Building a complete data set and getting that out to people
2) Updating an existing data set with new data
3) Processing a data set in some useful way
Generating the initial dump today is super expensive -- because it's a single compressed XML stream, we have to copy and re-copy most of the same data over, and over, and over.
And today there's no good way to just "apply" an incremental dump on top of your existing download.
3. Smaller dump sizes
The dump files continue to grow as the text of each revision is stored in the XML file. Currently, the uncompressed XML dump files of the English Wikipedia are about 5.5Tb in size and this will only continue to grow. An alternative would be to replace the <text> tag with a <text_added> and <text_removed> tags. A page can still be reconstructed by patching multiple <text_added> and <text_removed> tags. We can provide a simple script / tool that would reconstruct the full text of an article up to a particular date / revision id. This has two advantages:
- The dump files will be significantly smaller
- It will be easier and faster to analyze the types of edits. Who is
adding a template, who is wikifying an edit, who is fixing spelling and grammar mistakes.
Broadly speaking some sort of diff storage makes a lot of sense; especially if it doesn't require reproducing those diffs all the time. :)
But be warned that there are different needs and different ways of processing data; diffs again interfere with random access, as you need to be able to fetch adjacent items to reproduce the text. If you're just trundling along through the entire dump and applying diffs as you go to reconstruct the text, then you're basically doing what you already do when doing on-the-fly decompression of the .xml.bz2 or .xml.7z -- it may, or may not, actually save you anything for this case.
Of course if all you really wanted was the diff, then obviously that's going to help you. :)
-- brion
On Tue, Aug 23, 2011 at 5:35 PM, Brion Vibber brion@pobox.com wrote: <snip>
Broadly speaking some sort of diff storage makes a lot of sense; especially if it doesn't require reproducing those diffs all the time. :)
But be warned that there are different needs and different ways of processing data; diffs again interfere with random access, as you need to be able to fetch adjacent items to reproduce the text. If you're just trundling along through the entire dump and applying diffs as you go to reconstruct the text, then you're basically doing what you already do when doing on-the-fly decompression of the .xml.bz2 or .xml.7z -- it may, or may not, actually save you anything for this case.
Of course if all you really wanted was the diff, then obviously that's going to help you. :)
I've found that diff representations of the full history can knock off about 95% of the uncompressed size. When stacked with generic compressors such as bz2 and 7z, an intelligent differencing scheme can still see improvement such that .diff.7z is about 10-50% smaller than .xml.7z while representing the same content. As you note though, the trade-off is that you have to look at many diffs to reconstruct the page's content. Given that hard disks are cheap, the biggest advantage is probably really for people who want to study diffs as their main object of study.
-Robert Rohde
wikitech-l@lists.wikimedia.org