On 11/18/07, Steve Bennett stevagewp@gmail.com wrote:
Solutions I can think of so far:
- Explicitly make the match for text to be 'a'..'z' | 'A'..'Z' |
MW_img_thumbnail | ...
- Make tokens for individual letters (Aa, Bb...) then make the parser
recognise a pattern like Tt + Hh + Uu + Mm...
- Make a token which is '|thumbnail', then use some
trick to distinguish '|thumbnailblah' from '|thumbnail|'.
- Like 1), but use a localised lexer so that those words are only tokens
in this specific context.
- Just match text, then use special markup at the parser level to look
into the text that was matched.
Omg it's so much easier than that. 6) Use a syntactic predicate:
option : (magicword '|') => magicword | caption;
magicword : 'magicword';
Spoken with blissful ignorance. It turns out that that solution doesn't work. Incorporating a literal string in the parser creates a new lexer token, which means that this type of thing doesn't parse: [[Image:thumbnail.jpg]].
Correct solution: optionorcaption : (mw_img_thumbnail (PIPE | LINK_END)) => ... | caption;
mw_img_thumbnail : {textis("thumbnail") | textis("thumb")}? mwletters;
Where 'textis' is an actual function that looks at the text of the token.
This solution uses both syntactic and semantic predicates: Syntactic predicate: if our text is an "mwletters" that matches a semantic predicate, followed by a PIPE or LINK_END, then parse it as an mw_img_thumbnail, not a caption. Semantic predicate: if the text of the mwletters is "thumbnail" or "thumb", then it's a valid "mw_img_thumbnail" word.
The syntactic predicate stops "...|thumbnail blah|" from parsing as a thumbnail option.
Whee, that was a bit harder than I expected.
Anyway, the mostly-complete image parsing code is here: http://www.mediawiki.org/wiki/Markup_spec/ANTLR
It parses all image options except "page", supports infinitely nested images and tolerates links in captions.
Steve