I'm hereby presenting a new Offline Reader for MediaWiki. Already
presented last year
on another mailing list, here it is explained in depth, for the
technical community.
The impatients can go directly to
http://www.wiki-web.es/mediawiki-offline-reader/
and download some samples.
Note: Almost all docs are in Spanish only, so if you have an special
interestest
on some of them, or don't mind on them not being in English, just ask. :-)
The application works from the wiki text found in XML dumps, and
includes MediaWiki
to parse them. Unlike other existing programs[1], MediaWiki isn't
working in a web server
fashion, the whole app is written in PHP, using PHP-GTK [2]. MediaWiki
is run as a
subquery, using the runkit extension [3]. The html result is fed back to
GtkIEEmbed for the
display.
The format for the files, is the same XML as the dumps, and the problem
of bzip files being
too big [5] is overcome by using separated borrow-wheels blocks (in
fact, for easiness I'm
using full bzip2 files concatenated) each N articles. Similar to the
(independent) approach
taken by Thanassis Tsiodras (ttsiod) on [6]. However, instead of
splitting the existing files,
they're recompressed split each N articles, and creating an index at the
same time.
The same path is taken for categories, which are stored in a new
category xml dump, which
combines the information spread along page, category, categorylinks and
page_props.
See bug 16176 [7] for more information, although my implementation can
probably be improved.
Instead of hacking MediaWiki at high-level for Article, Title... (this
could be a use case for the
recently suggested wikiNeedL), it is incorporated as database driver
[8], and applying some regex
to the SQL as in [5].
Limitations:
* The parser is slow, almost all the caching layers are disabled. Pages
with many
links/templates produce many accesses to fetch the pages. This is
specially noticeable on first
load, as Main Pages are usually complex and the interface looks freezed
(MediaWiki runs on the
same thread as the message loop).
* Windows-only for now (the only dependance is IE, there're assumptions
based of it's protocol behavior).
* Only a few special pages are working (the most noticeable is the lack
of Special:Search for fulltext)
* It doesn't show the full author list.
Sweet and happy new year!
Ángel González
1 - Magnus wp_de_2004_05, Standalonewiki, YAWR, Tntreader...
2 -
http://gtk.php.net/
3 -
http://pecl.php.net/package/runkit
4 -
http://live.gnome.org/GtkIEEmbed
5 -
http://permalink.gmane.org/gmane.science.linguistics.wikipedia.technical/36…
6 -
http://users.softlab.ece.ntua.gr/~ttsiod/buildWikipediaOffline.html
7 -
https://bugzilla.wikimedia.org/show_bug.cgi?id=16176
8 - Files DatabaseIndexedXml.php, DumpIndexSearcher.php,
{Article,Category}Fetcher.php