On 3 January 2011 21:54, Andreas Jonsson andreas.jonsson@kreablo.se wrote:
2010-12-29 08:33, Andrew Dunbar skrev:
I've thought a lot about this too. It certainly is not any type of standard grammar. But on the other hand it is a pretty common kind of nonstandard grammar. I call it a "recursive text replacement grammar".
Perhaps this type of grammar has some useful characteristics we can discover and document. It may be possible to follow the code flow and document each text replacement in sequence as a kind of parser spec rather than trying and failing again to shoehorn it into a standard LALR grammar.
If it is possible to extract such a spec it would then be possible to implement it in other languages.
Some research may even find that is possible to transform such a grammar deterministically into an LALR grammar...
But even if not I'm certain it would demysitfy what happens in the parser so that problems and edge cases would be easier to locate.
From my experience of implementing a wikitext parser, I would say that it might be possible to transform wikitext to a token stream that is possible to parse with a LALR parser. My implementation (http://svn.wikimedia.org/svnroot/mediawiki/trunk/parsers/libmwparser) uses Antlr (which is an LL parser generator) and only rely on context sensitive parsing (Antlr's semantic predicates) for parsing apostrophes (bold and italics), and this might be possible to solve in a different way. The rest of the complex cases are handled by the lexical analyser that produce a well behaving token stream that can be relatively straightforwardly parsed.
My implementation is not 100% compatible, but I think that a 100% compatible parser is not desirable since the most exotic border cases would probably be characterized as "bugs" anyway (e.g. [[Link|<table class="]]">). But I think that the basic idea can be used to produce a sufficiently compatible parser.
In that case what is needed is to hook your parser into our current code and get it create output if you have not done that already. Then you will want to run the existing parser tests on it. Then you will want to run both parsers over a large sample of existing Wikipedia articles (make sure you use the same revisions on both parsers!) and run them through diff. Then we'll have a decent idea of whether there are any edge cases you didn't spot or whether any of them are exploited in template magic.
Let us know the results!
Andrew Dunbar (hippietrail)
Best Regards,
/Andreas
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l