On 2/11/07, Eric Astor eastor1@swarthmore.edu wrote:
Just one example - probably of the 5% very hard category:
'''''hello''' hi'' vs. '''''hi'' hello'''
Rendered in HTML, the first reads <i><b>hello</b> hi</i>, and the second reads <b><i>hi</i> hello</b>. The problem is that the meaning of the first 5 quotes changes based on the order in which the bold and italic regions close - which is not determined while scanning left-to-right.
This is where we could redefine the behavior slightly. Have ''''' always be <b><i>. Then, if ''' occurs first, output </i></b><i>. On the other hand, from what you say next, I'm not sure that will help.
Another example:
'''hello ''hi''' there''
MediaWiki renders this as <b>hello <i>hi</i></b><i> there</i>, properly handling overlapping formatting.
There are ways to deal with these... putting off the resolution until a later pass is the only way I know of that deals with the first one, and it's a bit touchy. Manageable, but touchy.
Well, we could just output invalid XML for both these cases and then fix it in the Sanitizer/Tidy pass, I guess. In some clearly defined manner, of course, perhaps stated informally in the grammar, or formally as a separate non-parsing algorithm.