One issue I've been having has to do with high level punctuation getting tangled up in embedded text. In wikitext, it's generally ok to write a literal ]] - it means two right square brackets in a row. But of course in an [[image:foo.jpg|caption - ]] means the end of the image element, not just raw text.
I can see, and have sort of tried, three ways to handle this:
1) Using traditional grammar approaches, backtracking and so forth, hoping the parser is smart enough to match the right string, and "pull back" at the right moment. Unfortunately, this seems very difficult without an extremely good knowledge of the compiler compiler, and is probably slow to boot.
2) Using bottom up* context flags like "inside image element", so when an "]]" is found, we know whether or not we can treat them as literals. Problem: you end up smearing knowledge about the image element everywhere: why does the RIGHT_SQUARE_BRACKET literal want to know anything about image elements?
3) Using top down restrictions on literals like "prohibit literal double right square bracket". Similar to 2), but when a "]]" is found it just dumbly looks at the corresponding flag to decide whether to match it as literal.
Method 3 seems the most promising now. I was using 2), but it seemed to become very complex all of a sudden.
I now have code that looks like this:
image_caption @init {prohibit_literal_link_end++; prohibit_literal_pipe++;} : inline_text? -> ^(TEXT inline_text); finally {prohibit_literal_link_end--; prohibit_literal_pipe--;} ...
literal_link_end: {prohibit_literal_link_end <= 0}? => link_end;
This seems to be relatively readable too: "An image caption is any text, except that there can't be an unescaped literal pipe or link_end (]]) in it." and "A literal link end is whenever you encounter a raw link_end, unless someone has said you can't."
Seems to keep me a bit saner, too. Anyway, just thought I would share.
Steve * I'm using the terms 'bottom up' and 'top down' extremely loosely here.
wikitext-l@lists.wikimedia.org