I'm hereby presenting a new Offline Reader for MediaWiki. Already presented last year on another mailing list, here it is explained in depth, for the technical community.
The impatients can go directly to http://www.wiki-web.es/mediawiki-offline-reader/ and download some samples.
Note: Almost all docs are in Spanish only, so if you have an special interestest on some of them, or don't mind on them not being in English, just ask. :-)
The application works from the wiki text found in XML dumps, and includes MediaWiki to parse them. Unlike other existing programs[1], MediaWiki isn't working in a web server fashion, the whole app is written in PHP, using PHP-GTK [2]. MediaWiki is run as a subquery, using the runkit extension [3]. The html result is fed back to GtkIEEmbed for the display. The format for the files, is the same XML as the dumps, and the problem of bzip files being too big [5] is overcome by using separated borrow-wheels blocks (in fact, for easiness I'm using full bzip2 files concatenated) each N articles. Similar to the (independent) approach taken by Thanassis Tsiodras (ttsiod) on [6]. However, instead of splitting the existing files, they're recompressed split each N articles, and creating an index at the same time. The same path is taken for categories, which are stored in a new category xml dump, which combines the information spread along page, category, categorylinks and page_props. See bug 16176 [7] for more information, although my implementation can probably be improved.
Instead of hacking MediaWiki at high-level for Article, Title... (this could be a use case for the recently suggested wikiNeedL), it is incorporated as database driver [8], and applying some regex to the SQL as in [5].
Limitations: * The parser is slow, almost all the caching layers are disabled. Pages with many links/templates produce many accesses to fetch the pages. This is specially noticeable on first load, as Main Pages are usually complex and the interface looks freezed (MediaWiki runs on the same thread as the message loop). * Windows-only for now (the only dependance is IE, there're assumptions based of it's protocol behavior). * Only a few special pages are working (the most noticeable is the lack of Special:Search for fulltext) * It doesn't show the full author list.
Sweet and happy new year!
Ángel González
1 - Magnus wp_de_2004_05, Standalonewiki, YAWR, Tntreader... 2 - http://gtk.php.net/ 3 - http://pecl.php.net/package/runkit 4 - http://live.gnome.org/GtkIEEmbed 5 - http://permalink.gmane.org/gmane.science.linguistics.wikipedia.technical/367... 6 - http://users.softlab.ece.ntua.gr/~ttsiod/buildWikipediaOffline.html 7 - https://bugzilla.wikimedia.org/show_bug.cgi?id=16176 8 - Files DatabaseIndexedXml.php, DumpIndexSearcher.php, {Article,Category}Fetcher.php