Nick Jenkins wrote:
As one of the many people who's done so, I agree. :) The problem is that ~80% of wikimarkup is pretty straightforward to parse using standard methods, another 10-15% can be done without huge difficulty using known-but-less-standard methods, and the remaining 5% doesn't fit well at all into any of the normal models of lexing/parsing.
[...snip...]
-Mark
Can I maybe suggest please giving some examples that you encountered of the 10-15% hard category, and the 5% very hard category?
I ask so that if anyone feels tempted to start on defining the behaviour, we can gently suggest doing the harder stuff *first* (with examples), thus hopefully preventing the situation where we have multiple unfinished 80%-done definitions, and no 100%-complete formal definitions.
All the best, Nick.
Just one example - probably of the 5% very hard category:
'''''hello''' hi'' vs. '''''hi'' hello'''
Rendered in HTML, the first reads <i><b>hello</b> hi</i>, and the second reads <b><i>hi</i> hello</b>. The problem is that the meaning of the first 5 quotes changes based on the order in which the bold and italic regions close - which is not determined while scanning left-to-right.
Another example:
'''hello ''hi''' there''
MediaWiki renders this as <b>hello <i>hi</i></b><i> there</i>, properly handling overlapping formatting.
There are ways to deal with these... putting off the resolution until a later pass is the only way I know of that deals with the first one, and it's a bit touchy. Manageable, but touchy.
If I have time this summer, I'm going to look at formalizing a parser again... see if I can make a start on hammering out a somewhat more formal structure for handling some of the tougher cases, after I've tackled OLPC's CrossMark.
- Eric Astor