Jared Williams wrote:
Just one example - probably of the 5% very hard category:
'''''hello''' hi'' vs. '''''hi'' hello'''
Rendered in HTML, the first reads <i><b>hello</b> hi</i>, and the second reads <b><i>hi</i> hello</b>. The problem is that the meaning of the first 5 quotes changes based on the order in which the bold and italic regions close - which is not determined while scanning left-to-right.
Another example:
'''hello ''hi''' there''
MediaWiki renders this as <b>hello <i>hi</i></b><i> there</i>, properly handling overlapping formatting.
There are ways to deal with these... putting off the resolution until a later pass is the only way I know of that deals with the first one, and it's a bit touchy. Manageable, but touchy.
Think the easiest method (and nearer to be able to keep as it a single pass) is to use DOM. Guarentees valid XML output always, which I believe the MediaWiki parser doesn't always do.
Also can easly going back and fixing up the DOM tree, if the parser has made an initial wrong choice. Like
'''italics''
It might start out as <b>italics</b>, but seeing '' its can be corrected to '<i>italics</i>.
Jared
Keeping an abstract tree as an intermediate representation helps, but does not fix, this problem. Dealing with things like '''italics'' is non-trivial in any case, as if we're going to retain this behavior, no context-free grammar (at least with fixed lookahead) can possibly suffice.
Whatever happens to handle this, it will have to be at a separate stage from the original parsing. What remains is a question of how many extra stages we will need.
- Eric