Dear all,
I thought I'd introduce myself first. I've been using Wikipedia for about four years, mostly the English one (user houshuang), but also occasionally the Norwegian (I am Norwegian), the Chinese, and lately the Indonesian (I currently live in Jakarta). I am also very interested in Wikipedia from a social and technological perspective.
Lately I've been working a lot on a way to use Wikipedia HTML dump files offline. I posted about this a while ago, the current version can be downloaded from http://houshuang.org/blog . I am working on one with a few improvements. It's working quite well right now (I have about 8 small language wikis on my HD, and all the interlingua links work, etc). THe idea is that it works without unzipping the files first, you just place them in the right directory, and given that you have 7zip and Ruby, it should just work (serving files to localhost). I will make a better installer, a tiny graphical UI etc later, so that I can put it on a CD with a given language file, and it will just work with one click, on Mac, Windows, Linux.
The problem is that 7zip was never optimized to work quickly on extracting one given file out of hundreds of thousands, or millions. Right now, the Indonesian wikipedia (60MB 7zipped) takes about 15 seconds for a page on my two-year old iBook, whereas the Chinese one (250MB 7zipped) takes about 150 seconds for a page. I haven't dared try any of the bigger ones, like the German (1,5GB) or the English (four files a 1,5GB)... My first thought was if it was possible to modify the open-source 7zip to generate an index of which block the different files where, which would then make the actual extraction a lot faster. The problem is that I suck at C, and I have been looking for people to help me, even offering a small bounty to the developer. (If anyone here would help me, that would be MUCH appreciated! I personally think it would be quite easy, given the sourcecode that exists, but I don't know for sure).
The developer himself suggested packing the Wikipedia dump file with something like this 7z a -mx -ms=32M -m0d=32M archive *
which would make it more modular, and much faster to access. However, I really don't want to repack all the dumpfiles (I cannot imagine the time it would take to rezip the 1,5GB big file) and I don't have the capacity to host them - my intention has all the time been for my program to work out of the box with the Wikipedia dump files... So I am writing here, since I don't know how else to contact the dump file developers... is there any way they would consider using these options for making the dump files, or if not, what are the reasons (maybe the files would get slightly bigger, but I think the benefit would far outweigh the disadvantage!).
Anyway, any help or guidance or ideas would be much appreciated. And feel free to play around with my program. It's very unfinished (and I have a better version which I will publish soon), but it's already quite functional, and has saved me through several long boring meetings in fancy hotels with too expensive wifi :)
Thank you very much, and please let me know if there are other mailing lists - Wikipedia discussion pages which would be more appropriate for this question.
Stian in Jakarta