Brion Vibber wrote:
Our present "parser" is a hack with a series of regexps and other horrors, whose steps often stomp on each other and produce hard to fix errors. It's not something to be emulated; rather it is our greatest shame. Currently we cannot guarantee that XHTML output will be well-formed, so changing it to a custom XML format would be a waste of time, as it would not be transformable.
But, still, a parser written in php is necessary. Albeit a better one.
A character-by-character parser that can go from the beginning to the end and churn something out that's guaranteed to be well-formed should be less error-prone and easier to maintain. Whether flex/bison is the best route I cannot say, but it's worth exploring.
A proof-of-concept implementation might be a good thing to have around. But if I may, I can't see how, for instance, a simple flex/bison parser could adequately parse a set of varying extension languages, like the one used in <math> tags, into valid XML (In this case, MathML, I guess).
The parser would have to be modular, so each parser module would be used to translate a language. Well, this sparks some ideas.
Having this parser output an internal XML format instead of XHTML directly means a) we can maintain semantic information that would be lost in HTML and b) we can keep the base _parser_ separate from the code that does things like check for page existence, format the URLs for local links, and perhaps template transclusions. This allows transformation to other formats (XHTML, DocBook?) with less crap than eg trying to rewrite all the HTML into DocBook.
I completely agree. My question was about the best way of doing that parsing.
Cheers, Pedro.