On Sun, 2006-03-26 at 03:08 +0000, Ævar Arnfjörð
This fixes the "Bug 2702: Mismatched
<i>, <b> and <a> tags are
invalid" test case but it's not really an improvement. The test case
was supposed to demonstrate that we don't balance tags, which this
doesn't fix, it merely hacks around very specific cases with regular
expressions which fail if you insert more tags which would be fixed in
a parser that balanced tags properly.
I'm all for fixing the parser, but it's not an improvement to make
that parser test cases we have pass by basically writing a hack in the
parser to make just that test pass rather than fixing the core issue.
i'm the last one to disagree with you about the core issue.
I did start an (unfinished) attempt to write a Bison-based parser, and i
added tidy support to MediaWiki to get at least close to xhtml.
Timwi and others got further with the flexbisonparse module in CVS.
But still as it stands the parser in use is a sophisticated regexp
While i would welcome all effort being directed towards a proper
solution of the core issue this is simply not happening right now, for
various reasons. Shifting the focus of MediaWiki development from the
current parser to a better solution is difficult at best, i doubt
anything short of doing the full implementation or funding it will do. I
would love to be wrong on this though, and i would really like to work
on such a parser.
My task at hand was to fix bugs in the current parser, not working on a
replacement for it. The change above makes MediaWiki-generated (x)html
slightly less invalid, so i consider it an improvement overall.
Has anyone considered using a packrat parser for implementing MediaWiki
parsing? Packrat parsers use simple grammar-driven top-down recursion,
are naturally ambiguity-resolving, and use memoization to prevent the
otherwise inevitable combinatorial explosion. They are quite
fast-running and yet easy to write, and are much easier to understand
and hack than most compiler-compiler-style parsers.