Gabriel Wicke schrieb:
On Sun, 2006-03-26 at 03:08 +0000, Ævar Arnfjörð
> This fixes the "Bug 2702: Mismatched <i>, <b> and <a> tags
> invalid" test case but it's not really an improvement. The test case
> was supposed to demonstrate that we don't balance tags, which this
> doesn't fix, it merely hacks around very specific cases with regular
> expressions which fail if you insert more tags which would be fixed in
> a parser that balanced tags properly.
for fixing the parser, but it's not an improvement to make
that parser test cases we have pass by basically writing a hack in the
parser to make just that test pass rather than fixing the core issue.
i'm the last one to disagree with you about the core issue.
I did start an (unfinished) attempt to write a Bison-based parser, and i
added tidy support to MediaWiki to get at least close to xhtml.
Timwi and others got further with the flexbisonparse module in CVS.
I'm not joking (and not trying to spam the mailing list yet again with
my stuff;-) when I say that my PHP-based wiki2xml parser/converter is
probably the only working alternative out there. Note that this is not
one of my many abandoned, half-finished toys; this one works, on the
whole syntax, on real pages.
Among other nice habits, it nests HTML tags correctly (or converts them
to plain text, which makes it easy to see the error). Same of course for
wiki markup. It might be a little more strict than the current MediaWiki
parser; that might be a good thing, actually.
It is also surprisingly fast for its approach (which is similar to the
code structure flexbisonparse generates as C code). It converts
[[en:Biology]], including all templates, to XML in less than 0.5 seconds
(not counting the time it takes to load the source texts via web). I am
certain these times can be reduced by further tweaking. However, an
additional step to generate XHTML form XML has to be added (which should
be quite fast IMHO).
During the XML generation, it can collect all links in the page, which
can then be looked up with the usual single query before converting it
to XHTML, making the current replacement hell obsolete. Actually, it
doesn't use *any* regexps, except to remove HTML comments.
While there are likely lots of new, fun bugs to discover, IMHO they can
be squashed far more easily than in the current parser, due to its
cleaner structure (hey, don't laugh! I can write structured code if I
really want to!:-)
I will go ahead and add an XHTML export function to the wiki2xml code.
Anyone interested in outfitting MediaWiki with a decent almost-parser is
invited to help ;-) especially with a potential integration into