New subject: Antwort: Introduction + 7zip dump format better for random access

26 Feb 2007


      Dear all,
I thought I'd introduce myself first. I've been using Wikipedia for
about four years, mostly the English one (user houshuang), but also
occasionally the Norwegian (I am Norwegian), the Chinese, and lately
the Indonesian (I currently live in Jakarta). I am also very
interested in Wikipedia from a social and technological perspective.
Lately I've been working a lot on a way to use Wikipedia HTML dump
files offline. I posted about this a while ago, the current version
can be downloaded from http://houshuang.org/blog . I am working on one
with a few improvements. It's working quite well right now (I have
about 8 small language wikis on my HD, and all the interlingua links
work, etc). THe idea is that it works without unzipping the files
first, you just place them in the right directory, and given that you
have 7zip and Ruby, it should just work (serving files to localhost).
I will make a better installer, a tiny graphical UI etc later, so that
I can put it on a CD with a given language file, and it will just work
with one click, on Mac, Windows, Linux.
The problem is that 7zip was never optimized to work quickly on
extracting one given file out of hundreds of thousands, or millions.
Right now, the Indonesian wikipedia (60MB 7zipped) takes about 15
seconds for a page on my two-year old iBook, whereas the Chinese one
(250MB 7zipped) takes about 150 seconds for a page. I haven't dared
try any of the bigger ones, like the German (1,5GB) or the English
(four files a 1,5GB)... My first thought was if it was possible to
modify the open-source 7zip to generate an index of which block the
different files where, which would then make the actual extraction a
lot faster. The problem is that I suck at C, and I have been looking
for people to help me, even offering a small bounty to the developer.
(If anyone here would help me, that would be MUCH appreciated! I
personally think it would be quite easy, given the sourcecode that
exists, but I don't know for sure).
The developer himself suggested packing the Wikipedia dump file with
something like this
7z a -mx -ms=32M -m0d=32M archive *
which would make it more modular, and much faster to access. However,
I really don't want to repack all the dumpfiles (I cannot imagine the
time it would take to rezip the 1,5GB big file) and I don't have the
capacity to host them - my intention has all the time been for my
program to work out of the box with the Wikipedia dump files... So I
am writing here, since I don't know how else to contact the dump file
developers... is there any way they would consider using these options
for making the dump files, or if not, what are the reasons (maybe the
files would get slightly bigger, but I think the benefit would far
outweigh the disadvantage!).
Anyway, any help or guidance or ideas would be much appreciated. And
feel free to play around with my program. It's very unfinished (and I
have a better version which I will publish soon), but it's already
quite functional, and has saved me through several long boring
meetings in fancy hotels with too expensive wifi :)
Thank you very much, and please let me know if there are other mailing
lists - Wikipedia discussion pages which would be more appropriate for
this question.
Stian in Jakarta
-- 
Stian Haklev - University of Toronto
http://houshuang.org/blog - Random Stuff that Matters