Here's another one, at the bottom of

http://www.mediawiki.org/wiki/User:Stevage
(note, mw_img_thumbnail means "the magic word 'img_thumbnail', however that is defined".)

The problem I have here is the options for the image: you'd like the word "thumbnail" to be a token, but then if you get a case like:

 [[image:finger.jpg|Note the impressive thumbnails.]] 

you get one token for "thumbnail" rather than "t" and "h" etc.

Solutions I can think of so far:
1) Explicitly make the match for text to be 'a'..'z' | 'A'..'Z' | MW_img_thumbnail | ...
2) Make tokens for individual letters (Aa, Bb...) then make the parser recognise a pattern like Tt + Hh + Uu + Mm...
3) Make a token which is '|thumbnail', then use some trick to distinguish '|thumbnailblah' from '|thumbnail|'.
4) Like 1), but use a localised lexer so that those words are only tokens in this specific context.
5) Just match text, then use special markup at the parser level to look into the text that was matched.

I've tried 1) and 2) and they both work. I'll probably try 5) next because 3) is just ugly.

Anyone have any comments or suggestions?

I really think writing the grammar in ANTLR is our best bet at this point. Advantages:
1) We're talking about actual, parseable grammar in an actual syntax, rather than the half-arsed EBNF/BNF we've done so far.
2) We can use ANTLRWorks to play with the grammar, visualise it etc.
3) One of the goals is to allow 3rd party parsers to generate code in a variety of languages. ANTLR already has 5 code targets and more (perhaps including PHP) are on the way.

Downsides:
1) ANTLR can't yet generate a parser in PHP. However, there may exist Java->PHP or C->PHP translators or something.

Steve