Re: [WikiEN-l] Wikipedia in a box

14 Aug 2007

On 8/14/07, zetawoof &lt;zetawoof(a)gmail.com&gt; wrote:
...
  On 8/14/07, Anthony &lt;wikimail(a)inbox.org&gt; wrote:
  On 8/14/07, David Gerard
&lt;dgerard(a)gmail.com&gt; wrote:

http://it.slashdot.org/article.pl?sid=07/08/13/1939231
  bzip2recover.  Genius.  I've been wanting to do something like this
 for a long time, and the one thing standing in my way was that I
 couldn't figure out how to do the random access bit. 
 It appears to have a few flaws, though:

 http://slashdot.org/comments.pl?sid=268617&cid=20222493

 Basically, the approach of splitting the database into 900 kB chunks
 means that you may end up splitting the XML headers between chunks,
 making indexing miss a few articles.
 That seems like something pretty easy to fix if I'm understanding the
problem correct, though.  Just special case it when building the index
and when looking up those titles (should be less than 1%) you
uncompress two files instead of one.

Now what I'm interested in is whether or not the splitting of the file
into multiple chunks is actually necessary.  It seems you should be
able to just use bzip2recover on parts of the complete file.  And with
HTTP/1.0 you can download sections of the complete file directly from
Wikipedia's servers (or, if one gets set up, from a bittorrent
client).  So you don't even have to wait until the 2 gig download is
complete, you can download while you browse.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

Re: [WikiEN-l] Wikipedia in a box