-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Hi Mingli
I guess everyone gave up on the dream of being able to define the current syntax in any sane, well-defined form ;)
I tried to build a parser similar to flexbisonparse a while ago, using flex and bison to create an XML parse tree. Of course, I failed miserably after two weeks of work and went back to the Perl regex monstrosity we use at the company. But I did find out the following things which may be useful for any future efforts:
I believe it's wrong to attempt to create a single parser for MediaWiki syntax (like flexbisonparse attempted). A better and much more simple way is to define multiple formal grammars for each step in the parsing. This way you can get around the problem when an xml-like tag is constructed from different templates for example. My attempt included separate flex/bison parsers for:
<noinclude>, <includeonly>, ... parts
templates transclusion (e.g. {{{ and {{, constructs)
text formatting
possibly more steps for tables, etc. but I didn't get this far.
The biggest problem defining these is graceful degradation on broken input. It's not that hard to get the parser to work in simple, well defined cases. But if you want to get anywhere near the way the current parser degrades on ambiguous input the parser definitions start to grow out of hand. And parsing speed ends up in the dumps. You're just trying to cram context into a context-free grammar.
- From my observations I believe that the only possible way that any formal grammar will replace the current PHP parser is if the MediaWiki team is prepared to change the current philosophy of desperately trying to make sense of any kind of broken string of characters the user provides i.e. if MediaWiki could throw up a syntax error on invalid input and/or they significantly reduce the number of valid constructs (all horrible combinations of bold/italics markup come to mind)
Given my understanding of the project I find this extremely unlikely. But then I'm not a MediaWiki developer, so I might be completely wrong here.
Best regards Tomaž Šolc
- --- Tomaž Šolc, Research & Development Zemanta Ltd, London, Ljubljana www.zemanta.com mail: tomaz@zemanta.com blog: http://www.tablix.org/~avian/blog