Pedro Medeiros wrote:
Speaking of which, why have a flex/bison parser? Wouldn't it be better if mediawiki created XML pages directly, like an "atom feed" or "rss" button? Mediawiki already carries a HTML engine for rendering wikitext to HTML, wouldn't it be easy to, with little modification, make it output XML (or even Docbook/XML) instead of HTML?
Our present "parser" is a hack with a series of regexps and other horrors, whose steps often stomp on each other and produce hard to fix errors. It's not something to be emulated; rather it is our greatest shame. Currently we cannot guarantee that XHTML output will be well-formed, so changing it to a custom XML format would be a waste of time, as it would not be transformable.
A character-by-character parser that can go from the beginning to the end and churn something out that's guaranteed to be well-formed should be less error-prone and easier to maintain. Whether flex/bison is the best route I cannot say, but it's worth exploring.
Having this parser output an internal XML format instead of XHTML directly means a) we can maintain semantic information that would be lost in HTML and b) we can keep the base _parser_ separate from the code that does things like check for page existence, format the URLs for local links, and perhaps template transclusions. This allows transformation to other formats (XHTML, DocBook?) with less crap than eg trying to rewrite all the HTML into DocBook.
-- brion vibber (brion @ pobox.com)