Re: [Wikitech-l] Changing XML Wikipedia Schema to Enable Smaller Incremental Dumps that are Hadoop ready

24 Aug 2011


      On Tue, Aug 23, 2011 at 5:35 PM, Brion Vibber brion@pobox.com wrote:
<snip>
...
Broadly speaking some sort of diff storage makes a lot of sense; especially
if it doesn't require reproducing those diffs all the time. :)
But be warned that there are different needs and different ways of
processing data; diffs again interfere with random access, as you need to be
able to fetch adjacent items to reproduce the text. If you're just trundling
along through the entire dump and applying diffs as you go to reconstruct
the text, then you're basically doing what you already do when doing
on-the-fly decompression of the .xml.bz2 or .xml.7z -- it may, or may not,
actually save you anything for this case.
Of course if all you really wanted was the diff, then obviously that's going
to help you. :)
I've found that diff representations of the full history can knock off
about 95% of the uncompressed size.  When stacked with generic
compressors such as bz2 and 7z, an intelligent differencing scheme can
still see improvement such that .diff.7z is about 10-50% smaller than
.xml.7z while representing the same content.  As you note though, the
trade-off is that you have to look at many diffs to reconstruct the
page's content.  Given that hard disks are cheap, the biggest
advantage is probably really for people who want to study diffs as
their main object of study.
-Robert Rohde

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Changing XML Wikipedia Schema to Enable Smaller Incremental Dumps that are Hadoop ready