7zip dump format - Wikitech-l

3 Mar 2007

Dear all, thank you for your feedback. I have been sick for a few
days, but I'm back :) And I just released 0.22 of the software, which
is a bit more robust - the 7zip issue is still unresolved though.
(http://houshuang.org/blog/wikipedia-offline-server).

7Zip blocksize: Initially I was looking into whether this was the
problem (suggestion from Igor Pavlov, the developer), but I think we
can conclude that this is not the problem, since the block size is
already quite small. Also, this doesn't make sense since the time it
takes to uncompress files is proportional to the number of files - on
my old computer, 15 seconds for Indonesian (60MB), 30 seconds for
Norwegian (120MB), 150 seconds for Chinese (250MB), too long for
French (500MB)! Whereas I tried to zip a GB of 1MB pictures, and
unzipping just one of them, even if it was in the 700th MB, was
snappy.

I tried looking at the 7zip code again, but it's still too
complicated, with too many files involved, for me to understand
anything. However, it makes all the sense in the world to me that
building up an index beforehand of file names and blocks (possibly
using 7za l -slt), and then feeding the block number to the 7z
extracter, and having it jump directly there. (or if it needs more
specific information, making a lister that outputs that information,
and then feeding that).

Platonides: I don't know if having 7z run constantly is a good idea.
While I am not wedded to Ruby, it makes it very easy to do regexps,
add small features, and the built-in webserver is very nice (includes
threading)! However, if you got this to work, I'd love it. Personally
I think it's all about building an index. By putting 2 million
filenames + block locations into for example an indexed mysqlite3
database, it should be able to extract one of them in milliseconds.
However, what 7zip does does not seem optimized - what I saw was a
simple for-next loop of all the filenames (and this is being done by
looping the data on disk, not on a model in memory)...

If you had any chance to have a look at this and try it out, I would
be eternally grateful! Try downloading a bigger dump file
(Chinese/Norwegian whatever), and you should very clearly see the
speed hit. Try manually inserting what information it needs (for
example block location) for a specific file, and see if it then
extracts that file very fast - if yes, we can make it a cmd-line
argument.

Thanks a lot
Stian