Parsing italics/bold - Wikitech-l

13 Nov 2007

What's the best way to approach parsing a long string of formatted text:

1) Treat each incidence of ''' or '' as an element to be translated
into
, , , or , using state ("context"?) to
determine which
2) Have a rule that treats an entire run of '''........''' as a
single
element, to be transformed into ........

I'm not even considering the much-discussed ambiguities of apostrophes.
Assuming simple, possibly well-formed but at least not pathological input,
which way is best?

A lot of our assumptions about how to parse come from parsing programming
languages, but I can't think of an analogous programming language feature:
''' doesn't nest, so it's not like an if-block, and its contents has
to be
parsed, so it's not like a comment. At best it seems vaguely like an inline
compiler flag, a #DEFINE/#UNDEFINE in C, or an OPTION BASE statement in VB,
all of which clearly change state and don't require block terminators.

The downside of 1) is it seems to tie us to HTML, and rely on this external
entity (the browser) to make sense of the begin/end tokens we spit out. It
also requires keeping track of state...
The downside of 2) is it seems difficult to fail gracefully if there is no
closing token or if overlapping bold/italics are found. At best, a section
of text might have to be parsed twice. At worst, it will be much more
pedantic than our current parser, and will ignore improper bold/italics
altogether.

Suggestions?

Steve