One issue I've been having has to do with high level punctuation
getting tangled up in embedded text. In wikitext, it's generally ok to
write a literal ]] - it means two right square brackets in a row. But
of course in an [[image:foo.jpg|caption - ]] means the end of the
image element, not just raw text.
I can see, and have sort of tried, three ways to handle this:
1) Using traditional grammar approaches, backtracking and so forth,
hoping the parser is smart enough to match the right string, and "pull
back" at the right moment. Unfortunately, this seems very difficult
without an extremely good knowledge of the compiler compiler, and is
probably slow to boot.
2) Using bottom up* context flags like "inside image element", so when
an "]]" is found, we know whether or not we can treat them as
literals. Problem: you end up smearing knowledge about the image
element everywhere: why does the RIGHT_SQUARE_BRACKET literal want to
know anything about image elements?
3) Using top down restrictions on literals like "prohibit literal
double right square bracket". Similar to 2), but when a "]]" is found
it just dumbly looks at the corresponding flag to decide whether to
match it as literal.
Method 3 seems the most promising now. I was using 2), but it seemed
to become very complex all of a sudden.
I now have code that looks like this:
image_caption
@init {prohibit_literal_link_end++; prohibit_literal_pipe++;}
: inline_text?
-> ^(TEXT inline_text);
finally {prohibit_literal_link_end--; prohibit_literal_pipe--;}
...
literal_link_end: {prohibit_literal_link_end <= 0}? => link_end;
This seems to be relatively readable too: "An image caption is any
text, except that there can't be an unescaped literal pipe or link_end
(]]) in it." and "A literal link end is whenever you encounter a raw
link_end, unless someone has said you can't."
Seems to keep me a bit saner, too. Anyway, just thought I would share.
Steve
* I'm using the terms 'bottom up' and 'top down' extremely loosely here.