Nick Jenkins wrote:
As one of the
many people who's done so, I agree. :) The problem is
that ~80% of wikimarkup is pretty straightforward to parse using
standard methods, another 10-15% can be done without huge difficulty
using known-but-less-standard methods, and the remaining 5% doesn't fit
well at all into any of the normal models of lexing/parsing.
[...snip...]
-Mark
Can I maybe suggest please giving some examples that you encountered of
the 10-15% hard category, and the 5% very hard category?
I ask so that if anyone feels tempted to start on defining the behaviour,
we can gently suggest doing the harder stuff *first* (with examples),
thus hopefully preventing the situation where we have multiple unfinished
80%-done definitions, and no 100%-complete formal definitions.
All the best,
Nick.
Just one example - probably of the 5% very hard category:
'''''hello''' hi''
vs.
'''''hi'' hello'''
Rendered in HTML, the first reads <i><b>hello</b> hi</i>, and the
second
reads <b><i>hi</i> hello</b>. The problem is that the meaning of
the
first 5 quotes changes based on the order in which the bold and italic
regions close - which is not determined while scanning left-to-right.
Another example:
'''hello ''hi''' there''
MediaWiki renders this as <b>hello <i>hi</i></b><i>
there</i>, properly
handling overlapping formatting.
There are ways to deal with these... putting off the resolution until a
later pass is the only way I know of that deals with the first one, and
it's a bit touchy. Manageable, but touchy.
If I have time this summer, I'm going to look at formalizing a parser
again... see if I can make a start on hammering out a somewhat more
formal structure for handling some of the tougher cases, after I've
tackled OLPC's CrossMark.
- Eric Astor