Brion Vibber wrote:
Tim's started a script for this, it's in the maintenance directory in CVS. This is the development version of MediaWiki and is not directly usable yet on current page download dumps as the database format has changed.
I guess I'd better say a few words about it, since the topic keeps coming up. The script produces quite nice HTML dumps, with HTML files distributed between directories identified by the first two bytes. The link URLs are rewritten appropriately. This is useful for English wikis since it allows you to guess the URL, but it doesn't work so well for languages with a different orthography. Doing it by character rather than byte could be useful, however that would give a much broader tree, especially for the CJK languages.
It's currently clumsy to use, requiring you to move HTMLDump.php from skins/disabled/ to skins/, which unfortunately enables the skin in the user interface. Any user who changed their skin to it would find that there are no user preference links allowing them to get back (the interface is greatly stripped down). We need to make a distinction between valid skins and skins available for users.
It rewrites stylesheet and image URLs to relative URLs in hard-coded paths (../../../images and ../../../skins). This needs to be made more flexible. It doesn't currently rewrite URLs for images from commons, or provide a way to package these images.
So it's a start, but there's still plenty of things to do. Luckily, porting the parser to a different language isn't one of them. I'm not working on it at the moment, so I won't mind if someone picks it up where I left off.
-- Tim Starling