On 8/14/07, David Gerard dgerard@gmail.com wrote:
bzip2recover. Genius. I've been wanting to do something like this for a long time, and the one thing standing in my way was that I couldn't figure out how to do the random access bit.
For those too lazy to read, here's the stroke of genius (IMHO): "We would certainly prefer not to use MySQL or any other database, since we are only *reading* Wikipedia, not writing into it. [....] we can use the bzip2recover tool (part of bzip2 distribution) to "recover" the individual parts of this compressed file: Basically, BZIP splits its input into 900K (by default) size blocks, [....] What this means, in plain English, is that we can convert the huge downloaded .bz2 file to a large set of small (smaller than 1MB) files, each one individually decompressible!"
Now, what's the applicable command to do this with the .7zip file?
On 8/14/07, Anthony wikimail@inbox.org wrote:
On 8/14/07, David Gerard dgerard@gmail.com wrote:
bzip2recover. Genius. I've been wanting to do something like this for a long time, and the one thing standing in my way was that I couldn't figure out how to do the random access bit.
It appears to have a few flaws, though:
http://slashdot.org/comments.pl?sid=268617&cid=20222493
Basically, the approach of splitting the database into 900 kB chunks means that you may end up splitting the XML headers between chunks, making indexing miss a few articles.
On 8/14/07, zetawoof zetawoof@gmail.com wrote:
On 8/14/07, Anthony wikimail@inbox.org wrote:
On 8/14/07, David Gerard dgerard@gmail.com wrote:
bzip2recover. Genius. I've been wanting to do something like this for a long time, and the one thing standing in my way was that I couldn't figure out how to do the random access bit.
It appears to have a few flaws, though:
http://slashdot.org/comments.pl?sid=268617&cid=20222493
Basically, the approach of splitting the database into 900 kB chunks means that you may end up splitting the XML headers between chunks, making indexing miss a few articles.
That seems like something pretty easy to fix if I'm understanding the problem correct, though. Just special case it when building the index and when looking up those titles (should be less than 1%) you uncompress two files instead of one.
Now what I'm interested in is whether or not the splitting of the file into multiple chunks is actually necessary. It seems you should be able to just use bzip2recover on parts of the complete file. And with HTTP/1.0 you can download sections of the complete file directly from Wikipedia's servers (or, if one gets set up, from a bittorrent client). So you don't even have to wait until the 2 gig download is complete, you can download while you browse.
On 8/14/07, Anthony wikimail@inbox.org wrote:
Now what I'm interested in is whether or not the splitting of the file into multiple chunks is actually necessary. It seems you should be able to just use bzip2recover on parts of the complete file.
Wow, it gets even better. Not only is this not necessary, but the source code to bzip2recover is tiny. See http://swtch.com/usr/local/plan9/src/cmd/bzip2/bzip2recover.c
If anyone is interested in discussing more about this, I'm going to bring it up on wiki-research-l. This is probably the wrong mailing list.
Anthony
David Gerard wrote:
It would have been simpler if he used the static HTML dump instead of the XML. It's not hard to make a desktop reader out of it. I wrote a proof of concept a while back.
-- Tim Starling
On 8/15/07, Tim Starling tstarling@wikimedia.org wrote:
David Gerard wrote:
It would have been simpler if he used the static HTML dump instead of the XML. It's not hard to make a desktop reader out of it. I wrote a proof of concept a while back.
Unless I'm reading your idea incorrectly, the static HTML dump seems to be about 4 times as large and in 7zip format instead of bzip2.
Got a link to your proof of concept?
Anthony wrote:
On 8/15/07, Tim Starling tstarling@wikimedia.org wrote:
David Gerard wrote:
It would have been simpler if he used the static HTML dump instead of the XML. It's not hard to make a desktop reader out of it. I wrote a proof of concept a while back.
Unless I'm reading your idea incorrectly, the static HTML dump seems to be about 4 times as large and in 7zip format instead of bzip2.
It's 1.5 times larger than pages-meta-current.xml.bz2, which is the equivalent XML dump, or 2.7 times larger than pages-articles.xml.bz2.
Got a link to your proof of concept?
http://noc.wikimedia.org/~tstarling/static-dump-reader.php.html
-- Tim Starling
On 8/17/07, Tim Starling tstarling@wikimedia.org wrote:
Anthony wrote:
On 8/15/07, Tim Starling tstarling@wikimedia.org wrote:
It would have been simpler if he used the static HTML dump instead of the XML. It's not hard to make a desktop reader out of it. I wrote a proof of concept a while back.
Unless I'm reading your idea incorrectly, the static HTML dump seems to be about 4 times as large and in 7zip format instead of bzip2.
It's 1.5 times larger than pages-meta-current.xml.bz2, which is the equivalent XML dump, or 2.7 times larger than pages-articles.xml.bz2.
Well, most significantly, it won't fit on a single layer DVD-R(W). And it's from April, rather than August.
pages-meta-current doesn't fit on a single layer DVD either, though.
Got a link to your proof of concept?
http://noc.wikimedia.org/~tstarling/static-dump-reader.php.html
Didn't have time to try it out yet, but it looks pretty nice.
On 8/15/07, Tim Starling tstarling@wikimedia.org wrote:
It would have been simpler if he used the static HTML dump instead of the XML. It's not hard to make a desktop reader out of it. I wrote a proof of concept a while back.
There are a number of those around now. I'm surveying them here: http://intelligentdesigns.net/blog/?p=73
On 8/14/07, David Gerard dgerard@gmail.com wrote:
http://it.slashdot.org/article.pl?sid=07/08/13/1939231
- d.
Is there going to be a way to do this for us programming n00bies? ~~~~
On 8/14/07, David Gerard dgerard@gmail.com wrote:
FWIW, I mailed him privately and suggested to use a zero-install server running PHP with MediaWiki, accessing the bzip dump through a modified Database.php file. I did that once with a sqlite database, but I ran into size/access speed trouble when trying a "real-size" wikipedia instead of a small demo. He seems to agree, but doesn't know enough PHP to hack it.
Magnus