I finally came around to work on the XML parser that has been promised so long. Good news first: It already works so it can render real wiki pages (in CVS HEAD).
Bad news: Still incredibly un-debugged and inclomplete.
For those of you who think "tell me when it works perfectly", you can stop reading now ;-)
So, whatdoyaneed to test it? You'll need the "flexbisonparse" module from CVS, which contains Timwis lexx/bison stuff. I didn't actually work on the bison files so far. I couldn't get it to "make" on my linux box (some zend libs not found or something), so I wrote a "Makefile.cli", which you should rename to "Makefile". It will "make" a command-line parser that can convert wikipedia markup to XML. One can pipe the wiki markup in and gets the XML.
Then, follow the three-line instructions at the top of ParserXML.php. It should work now.
The output will look strange, as it produces three copies of the article text (the rendered XHTML, the "dumped" xml, and a structured xml tree) as debug information. You can turn the debug information off by editing the very end of the ParserXML.php file (you'll see where).
As I had some trouble passing the wiki markup to the command-line version of the wiki2xml parser, I currently create a temporary file, pipe that into the parser, catch the output, and remove the temporary file again. I am aware that this is incredibly ugly, and that Timwi's default makefile, creating a shared PHP object and passing the data through there, is a lot cooler (and faster) than mine. However: 1. As I said, it didn't compile on my box, so I guess I'll not be the only one; compiling the cli version should work everywhere 2. The shared object thingy limits the use to MediaWiki
My thoughts to #2: I hope that the wiki2xml parser will be beneficial to many projects. One thing I thought of is a C/C++ program that can directly access an SQL dump, read the articles, have them parsed to XML, and then written as whatever-you-like: XHTML (static versions), PDF (WikiReader), DigiBib format (another XML format for the German CD).
As for the XML2XHTML parser itself: basic wiki markup, tables, links, images, basic html, nowiki are working. *Not* working is template inclusion, and things I didn't think of ;-)
I probably won't get much more done this week. But, I'll be at the Berlin conference, so we can talk about this, or even have a hacking session...
Magnus
wikitech-l@lists.wikimedia.org