Brion Vibber wrote:
Our present "parser" is a hack with a series of regexps and other
horrors, whose steps often stomp on each other and produce hard to fix
errors. It's not something to be emulated; rather it is our greatest
shame. Currently we cannot guarantee that XHTML output will be
well-formed, so changing it to a custom XML format would be a waste of
time, as it would not be transformable.
But, still, a parser written in php is necessary. Albeit a better one.
A character-by-character parser that can go from the
beginning to the
end and churn something out that's guaranteed to be well-formed should
be less error-prone and easier to maintain. Whether flex/bison is the
best route I cannot say, but it's worth exploring.
A proof-of-concept implementation might be a good thing to have around.
But if I may, I can't see how, for instance, a simple flex/bison parser
could adequately parse a set of varying extension languages, like the
one used in <math> tags, into valid XML (In this case, MathML, I guess).
The parser would have to be modular, so each parser module would be used
to translate a language. Well, this sparks some ideas.
Having this parser output an internal XML format
instead of XHTML
directly means a) we can maintain semantic information that would be
lost in HTML and b) we can keep the base _parser_ separate from the code
that does things like check for page existence, format the URLs for
local links, and perhaps template transclusions. This allows
transformation to other formats (XHTML, DocBook?) with less crap than eg
trying to rewrite all the HTML into DocBook.
I completely agree. My question was about the best way of doing that
parsing.
Cheers,
Pedro.