On 2/12/08, Daniel Kinzler daniel@brightbyte.de wrote:
- "parser hook" extensions (aka tag hooks aka extension tags), which conform to
a (fuzzy) xml syntax: <name foo="bar" bla=12 blubb>...</name>; The ... in between the tags should be completely opaque, the parser should skip everything up to the closing tag. There is no support for nesting, no expansion of templates or template parameters, nothing. Also, the the text *returned* by the extension is expected to be HTML, and should be passed through the generation stage untouched.
The trouble there is that <ref> for example can contain wikitext...which needs to be parsed. e.g.:
<ref>''The origin of species'', Darwin</ref>
So at a minimum I think we would need to distinguish those extensions whose internal text needs to be parsed?
- "parser functions" which conform to an extended template syntax:
{{#name: param|param|param...}}; In this case, all parameters have to be fully parsed and expanded, so this needs to work: {{#foo:xx|{{#bar|{{{bla|frob}}}}}|{{something}}}}
The output of parser functions may be wikitext that has to be further processed in context (just as if it where a normal template), or it may be HTML that has to be passed through (and a few more minor options). This is determined by each extension when registering the hook.
Afaik, these are converted by the preprocessor (recently rewritten by Tim), and are completely invisible to the parser?
Extensions may also introduce arbitrary magic words. Such extensions are impossible to make compatible with a new ANTRL based parser, they would have to be rewritten as plugins to such a parser. Would it be possible to allow such plugins? I'm thinking of allowing a way for extensions to redifine individula bits of the grammar.
It depends a bit on the limits of these "arbitrary magic words". I think it's actually suprisingly feasible to allow magic words that, say, consist of strings of letters surrounded by space, or certain predefined punctuation.
At first I thought that would be a nightmare, but in practice it isn't. As the second last rule before rendering a string of letters literally, I would simply add a (Java/PHP) check to see if the string matched any registered extension, and parse it as an extension magic word instead. Here's how that happens with __TOC__ etc:
magic_word: UNDERSCORE UNDERSCORE magic_word_text UNDERSCORE UNDERSCORE -> ^(MAGIC_WORD magic_word_text);
magic_word_text: {is_magic_word()}? letters;
@members { .... boolean is_magic_word() { return input.LT(1).getText().equalsIgnoreCase("NOTOC") || input.LT(1).getText().equalsIgnoreCase("TOC") || input.LT(1).getText().equalsIgnoreCase("FORCETOC") || input.LT(1).getText().equalsIgnoreCase("NOGALLERY") || input.LT(1).getText().equalsIgnoreCase("NOEDITSECTION") ; }
}
It would only be a problem if the contents of the magic word interfered with the lexer - say a combination of letters and other punctuation. But if the available combinations were predefined (eg, hyphen hyphen letters digit hyphen hyphen) then they can be dealt with, and the letters themselves defined at runtime.
Steve