XML parser: Good news, everyone! - Wikitech-l

20 Dec 2004


      I finally came around to work on the XML parser that has been promised 
so long. Good news first: It already works so it can render real wiki 
pages (in CVS HEAD).
Bad news: Still incredibly un-debugged and inclomplete.
For those of you who think "tell me when it works perfectly", you can 
stop reading now ;-)
So, whatdoyaneed to test it? You'll need the "flexbisonparse" module 
from CVS, which contains Timwis lexx/bison stuff. I didn't actually work 
on the bison files so far. I couldn't get it to "make" on my linux box 
(some zend libs not found or something), so I wrote a "Makefile.cli", 
which you should rename to "Makefile". It will "make" a command-line 
parser that can convert wikipedia markup to XML. One can pipe the wiki 
markup in and gets the XML.
Then, follow the three-line instructions at the top of ParserXML.php. It 
should work now.
The output will look strange, as it produces three copies of the article 
text (the rendered XHTML, the "dumped" xml, and a structured xml tree) 
as debug information. You can turn the debug information off by editing 
the very end of the ParserXML.php file (you'll see where).
As I had some trouble passing the wiki markup to the command-line 
version of the wiki2xml parser, I currently create a temporary file, 
pipe that into the parser, catch the output, and remove the temporary 
file again. I am aware that this is incredibly ugly, and that Timwi's 
default makefile, creating a shared PHP object and passing the data 
through there, is a lot cooler (and faster) than mine. However:
1. As I said, it didn't compile on my box, so I guess I'll not be the 
only one; compiling the cli version should work everywhere
2. The shared object thingy limits the use to MediaWiki
My thoughts to #2: I hope that the wiki2xml parser will be beneficial to 
many projects. One thing I thought of is a C/C++ program that can 
directly access an SQL dump, read the articles, have them parsed to XML, 
and then written as whatever-you-like: XHTML (static versions), PDF 
(WikiReader), DigiBib format (another XML format for the German CD).
As for the XML2XHTML parser itself: basic wiki markup, tables, links, 
images, basic html, nowiki are working. *Not* working is template 
inclusion, and things I didn't think of ;-)
I probably won't get much more done this week. But, I'll be at the 
Berlin conference, so we can talk about this, or even have a hacking 
session...
Magnus