On Wed, Nov 14, 2007 at 12:27:51AM +1100, Steve Bennett wrote:
On 11/13/07, Steve Bennett stevagewp@gmail.com wrote:
What's the best way to approach parsing a long string of formatted text:
- Treat each incidence of ''' or '' as an element to be translated into
<b>, <i>, </b>, or </i>, using state ("context"?) to determine which 2) Have a rule that treats an entire run of '''........''' as a single element, to be transformed into <b>.......</b>.
To answer my own question, I don't think 2) is possible, due to the legitimacy of constructs like:
Here is some ''italics with a [[link|that switches ''off]] the italics.
I think '' and ''' will have to be parsed as rather ambiguous "toggle state of bold/italics" tokens, whose meaning can be made more clear by walking the AST afterwards.
It's a pity, because the existing work on the EBNF assumed that they could be treated as blocks. http://www.mediawiki.org/wiki/Markup_spec (was at meta)
Unless someone wants to jump in and claim that the above construct is a mistake and that ''..'' *should* be a block of some kind.
Right here, Steve, you're hitting on the underlying problem with this project: some behavior of the current parser is defined and intentional, and some of it is probably an accident of the implementation.
Distinguishing these is probably a) important and b) impossible.
Cheers, -- jra