Dear all,
I thought I'd introduce myself first. I've been using Wikipedia for about four years, mostly the English one (user houshuang), but also occasionally the Norwegian (I am Norwegian), the Chinese, and lately the Indonesian (I currently live in Jakarta). I am also very interested in Wikipedia from a social and technological perspective.
Lately I've been working a lot on a way to use Wikipedia HTML dump files offline. I posted about this a while ago, the current version can be downloaded from http://houshuang.org/blog . I am working on one with a few improvements. It's working quite well right now (I have about 8 small language wikis on my HD, and all the interlingua links work, etc). THe idea is that it works without unzipping the files first, you just place them in the right directory, and given that you have 7zip and Ruby, it should just work (serving files to localhost). I will make a better installer, a tiny graphical UI etc later, so that I can put it on a CD with a given language file, and it will just work with one click, on Mac, Windows, Linux.
The problem is that 7zip was never optimized to work quickly on extracting one given file out of hundreds of thousands, or millions. Right now, the Indonesian wikipedia (60MB 7zipped) takes about 15 seconds for a page on my two-year old iBook, whereas the Chinese one (250MB 7zipped) takes about 150 seconds for a page. I haven't dared try any of the bigger ones, like the German (1,5GB) or the English (four files a 1,5GB)... My first thought was if it was possible to modify the open-source 7zip to generate an index of which block the different files where, which would then make the actual extraction a lot faster. The problem is that I suck at C, and I have been looking for people to help me, even offering a small bounty to the developer. (If anyone here would help me, that would be MUCH appreciated! I personally think it would be quite easy, given the sourcecode that exists, but I don't know for sure).
The developer himself suggested packing the Wikipedia dump file with something like this 7z a -mx -ms=32M -m0d=32M archive *
which would make it more modular, and much faster to access. However, I really don't want to repack all the dumpfiles (I cannot imagine the time it would take to rezip the 1,5GB big file) and I don't have the capacity to host them - my intention has all the time been for my program to work out of the box with the Wikipedia dump files... So I am writing here, since I don't know how else to contact the dump file developers... is there any way they would consider using these options for making the dump files, or if not, what are the reasons (maybe the files would get slightly bigger, but I think the benefit would far outweigh the disadvantage!).
Anyway, any help or guidance or ideas would be much appreciated. And feel free to play around with my program. It's very unfinished (and I have a better version which I will publish soon), but it's already quite functional, and has saved me through several long boring meetings in fancy hotels with too expensive wifi :)
Thank you very much, and please let me know if there are other mailing lists - Wikipedia discussion pages which would be more appropriate for this question.
Stian in Jakarta
Hi
How about using another algorithm that does this already?
-- chris
[...] The problem is that 7zip was never optimized to work quickly on extracting one given file out of hundreds of thousands, or millions. Right now, the Indonesian wikipedia (60MB 7zipped) takes about 15 seconds for a page on my two-year old iBook, whereas the Chinese one (250MB 7zipped) takes about 150 seconds for a page. I haven't dared try any of the bigger ones, like the German (1,5GB) or the English (four files a 1,5GB)... My first thought was if it was possible to modify the open-source 7zip to generate an index of which block the different files where, which would then make the actual extraction a lot faster. The problem is that I suck at C, and I have been looking for people to help me, even offering a small bounty to the developer. (If anyone here would help me, that would be MUCH appreciated! I personally think it would be quite easy, given the sourcecode that exists, but I don't know for sure).
if I'm not mistaken the 7z format has no index at the start of the file with pointers to where all files start, just use plain old zip files, they are larger in size but they are great for picking out only 1 file.
Cheers, Peter.
On 2/27/07, Stian Haklev shaklev@gmail.com wrote:
Dear all,
I thought I'd introduce myself first. I've been using Wikipedia for about four years, mostly the English one (user houshuang), but also occasionally the Norwegian (I am Norwegian), the Chinese, and lately the Indonesian (I currently live in Jakarta). I am also very interested in Wikipedia from a social and technological perspective.
Lately I've been working a lot on a way to use Wikipedia HTML dump files offline. I posted about this a while ago, the current version can be downloaded from http://houshuang.org/blog . I am working on one with a few improvements. It's working quite well right now (I have about 8 small language wikis on my HD, and all the interlingua links work, etc). THe idea is that it works without unzipping the files first, you just place them in the right directory, and given that you have 7zip and Ruby, it should just work (serving files to localhost). I will make a better installer, a tiny graphical UI etc later, so that I can put it on a CD with a given language file, and it will just work with one click, on Mac, Windows, Linux.
The problem is that 7zip was never optimized to work quickly on extracting one given file out of hundreds of thousands, or millions. Right now, the Indonesian wikipedia (60MB 7zipped) takes about 15 seconds for a page on my two-year old iBook, whereas the Chinese one (250MB 7zipped) takes about 150 seconds for a page. I haven't dared try any of the bigger ones, like the German (1,5GB) or the English (four files a 1,5GB)... My first thought was if it was possible to modify the open-source 7zip to generate an index of which block the different files where, which would then make the actual extraction a lot faster. The problem is that I suck at C, and I have been looking for people to help me, even offering a small bounty to the developer. (If anyone here would help me, that would be MUCH appreciated! I personally think it would be quite easy, given the sourcecode that exists, but I don't know for sure).
The developer himself suggested packing the Wikipedia dump file with something like this 7z a -mx -ms=32M -m0d=32M archive *
which would make it more modular, and much faster to access. However, I really don't want to repack all the dumpfiles (I cannot imagine the time it would take to rezip the 1,5GB big file) and I don't have the capacity to host them - my intention has all the time been for my program to work out of the box with the Wikipedia dump files... So I am writing here, since I don't know how else to contact the dump file developers... is there any way they would consider using these options for making the dump files, or if not, what are the reasons (maybe the files would get slightly bigger, but I think the benefit would far outweigh the disadvantage!).
Anyway, any help or guidance or ideas would be much appreciated. And feel free to play around with my program. It's very unfinished (and I have a better version which I will publish soon), but it's already quite functional, and has saved me through several long boring meetings in fancy hotels with too expensive wifi :)
Thank you very much, and please let me know if there are other mailing lists - Wikipedia discussion pages which would be more appropriate for this question.
Stian in Jakarta
-- Stian Haklev - University of Toronto http://houshuang.org/blog - Random Stuff that Matters
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Hi Stian,
I have been playing yesterday with your tool. In my experiments i found the access time for a page is almost constant, being the cost of uncompressing the index.
As you launch the program everytime, it needs to uncompress, store and read the index once and again.
I modified a bit 7z_C to accept the same parameters as 7z, and found that it takes 4.2 seconds, against 7-8.2 seconds of 7z, 6.2 vs 7.2...
It is a big difference, but still too slow. It may have to do with the program being less general: Code more specific + Less filesize. As the program has to be loaded by the OS a lot, the smaller, the better.
My original idea was modifying it so it could have the file index mmapped. But taking into account it has lots of pointers, now i'm considering making it a full server (dropping the ruby part) so it can have the 7z data in memory.
Opinions?
Stian Haklev wrote:
The developer himself suggested packing the Wikipedia dump file with something like this 7z a -mx -ms=32M -m0d=32M archive *
The static HTML dumps are already packed with -ms8m, a chunk size which I found during testing to give a good tradeoff between compression ratio and random access speed. But perhaps my testing was biased. You're not the first person to have complained about it. Reducing the chunk size to say 2-4MB might be a good move. But increasing it to 32MB would be a step in the wrong direction.
I'm not sure what -mx is meant to do, the manual implies that option is meant to be followed by a number. Presumably -m0d=32M is meant to set the dictionary to 32MB. I'm not sure what good that would do when the chunk size is far below the default dictionary size. Less memory usage perhaps?
-- Tim Starling
wikitech-l@lists.wikimedia.org