Wikipedia in a box

List overview All Threads
Download

newer

older

MONGO v. SevenOfDiamonds

Re: [WikiEN-l] Times article...

David Gerard

14 Aug 2007 14 Aug '07

2:28 a.m.

http://it.slashdot.org/article.pl?sid=07/08/13/1939231

- d.

Show replies by date

Anthony

14 Aug 14 Aug

3:30 a.m.

On 8/14/07, David Gerard dgerard@gmail.com wrote:

...

http://it.slashdot.org/article.pl?sid=07/08/13/1939231

bzip2recover. Genius. I've been wanting to do something like this for a long time, and the one thing standing in my way was that I couldn't figure out how to do the random access bit.

For those too lazy to read, here's the stroke of genius (IMHO): "We would certainly prefer not to use MySQL or any other database, since we are only *reading* Wikipedia, not writing into it. [....] we can use the bzip2recover tool (part of bzip2 distribution) to "recover" the individual parts of this compressed file: Basically, BZIP splits its input into 900K (by default) size blocks, [....] What this means, in plain English, is that we can convert the huge downloaded .bz2 file to a large set of small (smaller than 1MB) files, each one individually decompressible!"

Now, what's the applicable command to do this with the .7zip file?

zetawoof

1:07 p.m.

On 8/14/07, Anthony wikimail@inbox.org wrote:

...

On 8/14/07, David Gerard dgerard@gmail.com wrote:

...
http://it.slashdot.org/article.pl?sid=07/08/13/1939231

bzip2recover. Genius. I've been wanting to do something like this for a long time, and the one thing standing in my way was that I couldn't figure out how to do the random access bit.

It appears to have a few flaws, though:

http://slashdot.org/comments.pl?sid=268617&cid=20222493

Basically, the approach of splitting the database into 900 kB chunks means that you may end up splitting the XML headers between chunks, making indexing miss a few articles.

Anthony

2:44 p.m.

On 8/14/07, zetawoof zetawoof@gmail.com wrote:

...

On 8/14/07, Anthony wikimail@inbox.org wrote:

...
On 8/14/07, David Gerard dgerard@gmail.com wrote:

...
http://it.slashdot.org/article.pl?sid=07/08/13/1939231

bzip2recover. Genius. I've been wanting to do something like this for a long time, and the one thing standing in my way was that I couldn't figure out how to do the random access bit.

It appears to have a few flaws, though:

http://slashdot.org/comments.pl?sid=268617&cid=20222493

Basically, the approach of splitting the database into 900 kB chunks means that you may end up splitting the XML headers between chunks, making indexing miss a few articles.

That seems like something pretty easy to fix if I'm understanding the problem correct, though. Just special case it when building the index and when looking up those titles (should be less than 1%) you uncompress two files instead of one.

Now what I'm interested in is whether or not the splitting of the file into multiple chunks is actually necessary. It seems you should be able to just use bzip2recover on parts of the complete file. And with HTTP/1.0 you can download sections of the complete file directly from Wikipedia's servers (or, if one gets set up, from a bittorrent client). So you don't even have to wait until the 2 gig download is complete, you can download while you browse.

Anthony

3:11 p.m.

On 8/14/07, Anthony wikimail@inbox.org wrote:

...

Now what I'm interested in is whether or not the splitting of the file into multiple chunks is actually necessary. It seems you should be able to just use bzip2recover on parts of the complete file.

Wow, it gets even better. Not only is this not necessary, but the source code to bzip2recover is tiny. See http://swtch.com/usr/local/plan9/src/cmd/bzip2/bzip2recover.c

If anyone is interested in discussing more about this, I'm going to bring it up on wiki-research-l. This is probably the wrong mailing list.

Anthony

Tim Starling

15 Aug 15 Aug

1:30 a.m.

David Gerard wrote:

...

http://it.slashdot.org/article.pl?sid=07/08/13/1939231

It would have been simpler if he used the static HTML dump instead of the XML. It's not hard to make a desktop reader out of it. I wrote a proof of concept a while back.

-- Tim Starling

Anthony

1:58 p.m.

On 8/15/07, Tim Starling tstarling@wikimedia.org wrote:

...

David Gerard wrote:

...
http://it.slashdot.org/article.pl?sid=07/08/13/1939231

It would have been simpler if he used the static HTML dump instead of the XML. It's not hard to make a desktop reader out of it. I wrote a proof of concept a while back.

Unless I'm reading your idea incorrectly, the static HTML dump seems to be about 4 times as large and in 7zip format instead of bzip2.

Got a link to your proof of concept?

Tim Starling

17 Aug 17 Aug

12:02 a.m.

Anthony wrote:

...

On 8/15/07, Tim Starling tstarling@wikimedia.org wrote:

...
David Gerard wrote:

...
http://it.slashdot.org/article.pl?sid=07/08/13/1939231

It would have been simpler if he used the static HTML dump instead of the XML. It's not hard to make a desktop reader out of it. I wrote a proof of concept a while back.

Unless I'm reading your idea incorrectly, the static HTML dump seems to be about 4 times as large and in 7zip format instead of bzip2.

It's 1.5 times larger than pages-meta-current.xml.bz2, which is the equivalent XML dump, or 2.7 times larger than pages-articles.xml.bz2.

...

Got a link to your proof of concept?

http://noc.wikimedia.org/~tstarling/static-dump-reader.php.html

-- Tim Starling

Anthony

4:15 a.m.

On 8/17/07, Tim Starling tstarling@wikimedia.org wrote:

...

Anthony wrote:

...
On 8/15/07, Tim Starling tstarling@wikimedia.org wrote:

...
It would have been simpler if he used the static HTML dump instead of the XML. It's not hard to make a desktop reader out of it. I wrote a proof of concept a while back.

Unless I'm reading your idea incorrectly, the static HTML dump seems to be about 4 times as large and in 7zip format instead of bzip2.

It's 1.5 times larger than pages-meta-current.xml.bz2, which is the equivalent XML dump, or 2.7 times larger than pages-articles.xml.bz2.

Well, most significantly, it won't fit on a single layer DVD-R(W). And it's from April, rather than August.

pages-meta-current doesn't fit on a single layer DVD either, though.

...

...
Got a link to your proof of concept?

http://noc.wikimedia.org/~tstarling/static-dump-reader.php.html

Didn't have time to try it out yet, but it looks pretty nice.

Erik Moeller

5:47 a.m.

On 8/15/07, Tim Starling tstarling@wikimedia.org wrote:

...

It would have been simpler if he used the static HTML dump instead of the XML. It's not hard to make a desktop reader out of it. I wrote a proof of concept a while back.

There are a number of those around now. I'm surveying them here: http://intelligentdesigns.net/blog/?p=73

-- Toward Peace, Love & Progress: Erik DISCLAIMER: This message does not represent an official position of the Wikimedia Foundation or its Board of Trustees.

Gabe Johnson

16 Aug 16 Aug

8:32 p.m.

On 8/14/07, David Gerard dgerard@gmail.com wrote:

...

http://it.slashdot.org/article.pl?sid=07/08/13/1939231

d.

Is there going to be a way to do this for us programming n00bies? ~~~~

-- Absolute Power C^7rr8p£5 ab£$^u7£%y

Magnus Manske

17 Aug 17 Aug

5:31 a.m.

On 8/14/07, David Gerard dgerard@gmail.com wrote:

...

http://it.slashdot.org/article.pl?sid=07/08/13/1939231

FWIW, I mailed him privately and suggested to use a zero-install server running PHP with MediaWiki, accessing the bzip dump through a modified Database.php file. I did that once with a sqlite database, but I ran into size/access speed trouble when trying a "real-size" wikipedia instead of a small demo. He seems to agree, but doesn't know enough PHP to hack it.

Magnus

6344

Age (days ago)

6347

Last active (days ago)

wikien-l@lists.wikimedia.org

11 comments

7 participants

tags (0)

participants (7)

Anthony
David Gerard
Erik Moeller
Gabe Johnson
Magnus Manske
Tim Starling
zetawoof