On 8/14/07, zetawoof <zetawoof(a)gmail.com> wrote:
On 8/14/07, Anthony <wikimail(a)inbox.org> wrote:
On 8/14/07, David Gerard
<dgerard(a)gmail.com> wrote:
bzip2recover. Genius. I've been wanting to do something like this
for a long time, and the one thing standing in my way was that I
couldn't figure out how to do the random access bit.
It appears to have a few flaws, though:
http://slashdot.org/comments.pl?sid=268617&cid=20222493
Basically, the approach of splitting the database into 900 kB chunks
means that you may end up splitting the XML headers between chunks,
making indexing miss a few articles.
That seems like something pretty easy to fix if I'm understanding the
problem correct, though. Just special case it when building the index
and when looking up those titles (should be less than 1%) you
uncompress two files instead of one.
Now what I'm interested in is whether or not the splitting of the file
into multiple chunks is actually necessary. It seems you should be
able to just use bzip2recover on parts of the complete file. And with
HTTP/1.0 you can download sections of the complete file directly from
Wikipedia's servers (or, if one gets set up, from a bittorrent
client). So you don't even have to wait until the 2 gig download is
complete, you can download while you browse.