On Sun, Jul 19, 2009 at 5:23 AM, Chengbin Zhengchengbinzheng@gmail.com wrote:
Since the static HTML Wikipedia is not updating (please update), and XML updates like everyday, the logical choice is to go with XML. Is there any way to convert XML to HTML, like the static HTML version? I need it in HTML, and I don't want a one year old version of Wikipedia, with all the useless information on user talk, discussions, etc. Thank you.
There are plenty of options to parse the XML (or just the Mediawiki markup) to HTML like :
- http://sourceforge.net/apps/mediawiki/wikiprep/index.php?title=Main_Page (the parser is decent but currently no real full featured HTML export)
- http://wiki.laptop.org/go/Wiki_Slice (but not using XML as source, just stripping down output using ?action=raw)
- https://projects.fslab.de/projects/wpofflineclient/wiki/Specifications (but also using the raw action)
(a nice article of how to a static version of Wikipedia : http://users.softlab.ece.ntua.gr/~ttsiod/buildWikipediaOffline.html)
There is a also a nice list of all the parser available (usually from the Mediawiki markup to something else) :
http://www.mediawiki.org/wiki/Alternative_parsers
Regarding the XML format, usually you want to seek into the XML and look for start of <page> and the end of </page> to get the page and look for the <text> element containing the raw page in mediawiki markup format. So you can use all the existing mediawiki markup parser as long you have extract the latest revision of the page in mediawiki format.
Hope this helps,
adulau