On Wed, 14 Nov 2007 00:27:51 +1100, Steve Bennett wrote:
On 11/13/07, Steve Bennett stevagewp@gmail.com wrote:
What's the best way to approach parsing a long string of formatted text:
- Treat each incidence of ''' or '' as an element to be translated into
<b>, <i>, </b>, or </i>, using state ("context"?) to determine which 2) Have a rule that treats an entire run of '''........''' as a single element, to be transformed into <b>.......</b>.
To answer my own question, I don't think 2) is possible, due to the legitimacy of constructs like:
Here is some ''italics with a [[link|that switches ''off]] the italics.
I think '' and ''' will have to be parsed as rather ambiguous "toggle state of bold/italics" tokens, whose meaning can be made more clear by walking the AST afterwards.
It's a pity, because the existing work on the EBNF assumed that they could be treated as blocks. http://www.mediawiki.org/wiki/Markup_spec (was at meta)
Unless someone wants to jump in and claim that the above construct is a mistake and that ''..'' *should* be a block of some kind.
Steve PS http://www.usemod.com/cgi-bin/mb.pl?ConsumeParseRenderVsMatchTransform is useful for describing the parser transfomation we're trying to achieve. Apparently we're trying to convert a "match-transform" parser into a "consume-parse-render" parser.
You can't treat it as a toggle unless you know what you're toggling, which depends on matching open/close delimiters over an entire paragraph, since the end of the paragraph implicitly ends any bold/italic.
The behavior you're seeing is an artifact of the multi-stage parsing. that sort of thing really should go if we want to migrate to a recursive descent parser.