On 8/14/07, zetawoof zetawoof@gmail.com wrote:
On 8/14/07, Anthony wikimail@inbox.org wrote:
On 8/14/07, David Gerard dgerard@gmail.com wrote:
bzip2recover. Genius. I've been wanting to do something like this for a long time, and the one thing standing in my way was that I couldn't figure out how to do the random access bit.
It appears to have a few flaws, though:
http://slashdot.org/comments.pl?sid=268617&cid=20222493
Basically, the approach of splitting the database into 900 kB chunks means that you may end up splitting the XML headers between chunks, making indexing miss a few articles.
That seems like something pretty easy to fix if I'm understanding the problem correct, though. Just special case it when building the index and when looking up those titles (should be less than 1%) you uncompress two files instead of one.
Now what I'm interested in is whether or not the splitting of the file into multiple chunks is actually necessary. It seems you should be able to just use bzip2recover on parts of the complete file. And with HTTP/1.0 you can download sections of the complete file directly from Wikipedia's servers (or, if one gets set up, from a bittorrent client). So you don't even have to wait until the 2 gig download is complete, you can download while you browse.