On 11/13/07, Steve Bennett <stevagewp(a)gmail.com> wrote:
What's the best way to approach parsing a long string of formatted text:
1) Treat each incidence of ''' or '' as an element to be translated
into
<b>, <i>, </b>, or </i>, using state ("context"?) to
determine which
2) Have a rule that treats an entire run of '''........''' as a
single
element, to be transformed into <b>.......</b>.
To answer my own question, I don't think 2) is possible, due to the
legitimacy of constructs like:
Here is some ''italics with a [[link|that switches ''off]] the italics.
I think '' and ''' will have to be parsed as rather ambiguous
"toggle state
of bold/italics" tokens, whose meaning can be made more clear by walking the
AST afterwards.
It's a pity, because the existing work on the EBNF assumed that they could
be treated as blocks.
http://www.mediawiki.org/wiki/Markup_spec (was at
meta)
Unless someone wants to jump in and claim that the above construct is a
mistake and that ''..'' *should* be a block of some kind.
Steve
PS
http://www.usemod.com/cgi-bin/mb.pl?ConsumeParseRenderVsMatchTransform is
useful for describing the parser transfomation we're trying to achieve.
Apparently we're trying to convert a "match-transform" parser into a
"consume-parse-render" parser.