On 11/13/07, Steve Bennett
<stevagewp(a)gmail.com> wrote:
What's the best way to approach parsing a long string of formatted text:
1) Treat each incidence of ''' or '' as an element to be translated
into
<b>, <i>, </b>, or </i>, using state ("context"?) to
determine which 2)
Have a rule that treats an entire run of '''........''' as a
single
element, to be transformed into <b>.......</b>.
To answer my own question, I don't think 2) is possible, due to the
legitimacy of constructs like:
Here is some ''italics with a [[link|that switches ''off]] the italics.
I think '' and ''' will have to be parsed as rather ambiguous
"toggle
state of bold/italics" tokens, whose meaning can be made more clear by
walking the AST afterwards.
It's a pity, because the existing work on the EBNF assumed that they could
be treated as blocks.
http://www.mediawiki.org/wiki/Markup_spec (was at
meta)
Unless someone wants to jump in and claim that the above construct is a
mistake and that ''..'' *should* be a block of some kind.
Steve
PS
http://www.usemod.com/cgi-bin/mb.pl?ConsumeParseRenderVsMatchTransform
is useful for describing the parser transfomation we're trying to achieve.
Apparently we're trying to convert a "match-transform" parser into a
"consume-parse-render" parser.
You can't treat it as a toggle unless you know what you're toggling, which
depends on matching open/close delimiters over an entire paragraph, since
the end of the paragraph implicitly ends any bold/italic.
The behavior you're seeing is an artifact of the multi-stage parsing.
that sort of thing really should go if we want to migrate to a recursive
descent parser.