On 11/18/07, Steve Bennett <stevagewp(a)gmail.com> wrote:
Solutions I
can think of so far:
1) Explicitly make the match for text to be 'a'..'z' |
'A'..'Z' |
MW_img_thumbnail | ...
2) Make tokens for individual letters (Aa, Bb...)
then make the parser
recognise a pattern like Tt + Hh + Uu + Mm...
3) Make a token which is '|thumbnail',
then use some
trick to distinguish '|thumbnailblah' from
'|thumbnail|'.
4) Like 1), but use a localised lexer so that
those words are only tokens
in this specific context.
5) Just match text, then use special markup at
the parser level to look
into the text that was matched.
Omg it's so much easier than that.
6) Use a syntactic predicate:
option : (magicword '|') => magicword
| caption;
magicword
: 'magicword';
Spoken with blissful ignorance. It turns out that that solution
doesn't work. Incorporating a literal string in the parser creates a
new lexer token, which means that this type of thing doesn't parse:
[[Image:thumbnail.jpg]].
Correct solution:
optionorcaption
: (mw_img_thumbnail (PIPE | LINK_END)) =>
...
| caption;
mw_img_thumbnail : {textis("thumbnail") | textis("thumb")}?
mwletters;
Where 'textis' is an actual function that looks at the text of the token.
This solution uses both syntactic and semantic predicates:
Syntactic predicate: if our text is an "mwletters" that matches a
semantic predicate, followed by a PIPE or LINK_END, then parse it as
an mw_img_thumbnail, not a caption.
Semantic predicate: if the text of the mwletters is "thumbnail" or
"thumb", then it's a valid "mw_img_thumbnail" word.
The syntactic predicate stops "...|thumbnail blah|" from parsing as a
thumbnail option.
Whee, that was a bit harder than I expected.
Anyway, the mostly-complete image parsing code is here:
http://www.mediawiki.org/wiki/Markup_spec/ANTLR
It parses all image options except "page", supports infinitely nested
images and tolerates links in captions.
Steve