2010-12-29 08:33, Andrew Dunbar skrev:
I've thought a lot about this too. It certainly is not any type of standard grammar. But on the other hand it is a pretty common kind of nonstandard grammar. I call it a "recursive text replacement grammar".
Perhaps this type of grammar has some useful characteristics we can discover and document. It may be possible to follow the code flow and document each text replacement in sequence as a kind of parser spec rather than trying and failing again to shoehorn it into a standard LALR grammar.
If it is possible to extract such a spec it would then be possible to implement it in other languages.
Some research may even find that is possible to transform such a grammar deterministically into an LALR grammar...
But even if not I'm certain it would demysitfy what happens in the parser so that problems and edge cases would be easier to locate.
From my experience of implementing a wikitext parser, I would say that
it might be possible to transform wikitext to a token stream that is possible to parse with a LALR parser. My implementation (http://svn.wikimedia.org/svnroot/mediawiki/trunk/parsers/libmwparser) uses Antlr (which is an LL parser generator) and only rely on context sensitive parsing (Antlr's semantic predicates) for parsing apostrophes (bold and italics), and this might be possible to solve in a different way. The rest of the complex cases are handled by the lexical analyser that produce a well behaving token stream that can be relatively straightforwardly parsed.
My implementation is not 100% compatible, but I think that a 100% compatible parser is not desirable since the most exotic border cases would probably be characterized as "bugs" anyway (e.g. [[Link|<table class="]]">). But I think that the basic idea can be used to produce a sufficiently compatible parser.
Best Regards,
/Andreas