Stian Haklev wrote:
Dear all, thank you for your feedback. I have been
sick for a few
days, but I'm back :) And I just released 0.22 of the software, which
is a bit more robust - the 7zip issue is still unresolved though.
(
http://houshuang.org/blog/wikipedia-offline-server).
It has some kind of regression:
While with 0.21 i coud do wiki-html.rb path/to/wiki.7z with 0.22 i
can't. The server starts, but the dump is not loaded.
I tried looking at the 7zip code again, but it's
still too
complicated, with too many files involved, for me to understand
anything. However, it makes all the sense in the world to me that
building up an index beforehand of file names and blocks (possibly
using 7za l -slt), and then feeding the block number to the 7z
extracter, and having it jump directly there. (or if it needs more
specific information, making a lister that outputs that information,
and then feeding that).
Platonides: I don't know if having 7z run
constantly is a good idea.
While I am not wedded to Ruby, it makes it very easy to do regexps,
add small features, and the built-in webserver is very nice (includes
threading)! However, if you got this to work, I'd love it. Personally
I think it's all about building an index. By putting 2 million
filenames + block locations into for example an indexed mysqlite3
database, it should be able to extract one of them in milliseconds.
However, what 7zip does does not seem optimized - what I saw was a
simple for-next loop of all the filenames (and this is being done by
looping the data on disk, not on a model in memory)...
Well, it isn't exactly 7z. It's a server which instead of reading the
files from the local filesystem, reads them from a "7z filesystem".
By doing everything in the same process, it keeps the index open in
memory. No need to cache block locations it outside and pass it in and out.
In fact, it isn't caching anything. The article nevers goes through the
filesystem, fixing the bug whith non-ASCII7 characters.
It also avoids the cgi-wrapping of articles and thus the url-rewrites.
Efficience is quite good. Reading everything from the 7z file i get the
page in about 2 seconds. And there are a lot of queries (favicon,
article, skin, and one for each image).
With proper rewriting, i got even images working! ...provided that you
are online, of course. You could point the images to an intermediate
proxy of your lan.
The main problem I found is that it can take quite a lot of memory (some
buffer of uncompressed data, i guess). HTTP handling is very basic, but
good enough for the intended use.
If you want to try it out, drop me a line and i'll send you the source
code / the binary for Windows/Linux. I'm still thinking where to publich
it. Don't know if Brion may like having it on mediawiki's SVN.
------
While doing this i also found some bugs in the static dumps:
-There are on the articles calls to
w/extensions/wikihiero/img/hiero_XXX.png but the extension images are
not on the 7z. Probably worth adding it for the 'pedias.
-Paths of commons images are redirected to /upload/shared/... but images
included from commons descriptions have an absolute path
http://upload.wikimedia.org/wikipedia/commons/... (visit
Image:PD-icon.svg for example).
-TeX images point to a math/ folder, which is not included on the 7z.
Any hope of them being added?
-The included /skins/common/wikibits.js uses wgBreakFrames but articles
doesn't include the javascript properties. Added while making the dump?
Note: I'm using a december 2006 dump.