Re: [Foundation-l] dumps

24 Feb 2009

Anthony wrote:
...
  I've looked at the numbers and thought about this
in detail and I don't
 think so.  What definitely *would* be much more user friendly is to use a
 compression scheme which allows random access, so that end users don't have
 to decompress everything all at once in the first place.

 The uncompressed full history English Wikipedia dump is reaching (and more
 likely has already exceeded) the size which will fit on the largest consumer
 hard drives.  So just dealing with such a large file is a problem in
 itself.  And "an enormous text file" is not very useful without an index, so
 you've gotta import the thing into some sort of database anyway, which,
 unless you're a database guru is going to take longer than a simple
 decompression.

 In the long term, and considering how long it's taking to just produce a
 usable dump the long term may never come, the most user friendly dump would
 already be compressed, indexed, and ready for random access, so a reuser
 could just download and go (or even download on the fly as needed).  It
 could be done, but I make no bet on whether or not it will be done. 
I did make indexed, random-access, backwards compatible, XML dumps.
http://lists.wikimedia.org/pipermail/wikitech-l/2009-January/040812.html

Wouldn't be hard to plug into the dump process (just replace bzip2 on a
new DumpPipeOutput) but so far nobody seemed interested on it.

And there's the added benefit of the offline reader I implemented using
those files.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

Re: [Foundation-l] dumps