On 2/12/08, Daniel Kinzler <daniel(a)brightbyte.de> wrote:
1) "parser hook" extensions (aka tag hooks
aka extension tags), which conform to
a (fuzzy) xml syntax: <name foo="bar" bla=12 blubb>...</name>; The
... in
between the tags should be completely opaque, the parser should skip everything
up to the closing tag. There is no support for nesting, no expansion of
templates or template parameters, nothing. Also, the the text *returned* by the
extension is expected to be HTML, and should be passed through the generation
stage untouched.
The trouble there is that <ref> for example can contain
wikitext...which needs to be parsed. e.g.:
<ref>''The origin of species'', Darwin</ref>
So at a minimum I think we would need to distinguish those extensions
whose internal text needs to be parsed?
2) "parser functions" which conform to an extended template syntax:
{{#name: param|param|param...}}; In this case, all parameters have to be fully
parsed and expanded, so this needs to work:
{{#foo:xx|{{#bar|{{{bla|frob}}}}}|{{something}}}}
The output of parser functions may be wikitext that has to be further processed
in context (just as if it where a normal template), or it may be HTML that has
to be passed through (and a few more minor options). This is determined by each
extension when registering the hook.
Afaik, these are converted by the preprocessor (recently rewritten by
Tim), and are completely invisible to the parser?
Extensions may also introduce arbitrary magic words.
Such extensions are
impossible to make compatible with a new ANTRL based parser, they would have to
be rewritten as plugins to such a parser. Would it be possible to allow such
plugins? I'm thinking of allowing a way for extensions to redifine individula
bits of the grammar.
It depends a bit on the limits of these "arbitrary magic words". I
think it's actually suprisingly feasible to allow magic words that,
say, consist of strings of letters surrounded by space, or certain
predefined punctuation.
At first I thought that would be a nightmare, but in practice it
isn't. As the second last rule before rendering a string of letters
literally, I would simply add a (Java/PHP) check to see if the string
matched any registered extension, and parse it as an extension magic
word instead. Here's how that happens with __TOC__ etc:
magic_word: UNDERSCORE UNDERSCORE magic_word_text UNDERSCORE UNDERSCORE
-> ^(MAGIC_WORD magic_word_text);
magic_word_text: {is_magic_word()}? letters;
@members {
....
boolean is_magic_word() {
return
input.LT(1).getText().equalsIgnoreCase("NOTOC") ||
input.LT(1).getText().equalsIgnoreCase("TOC") ||
input.LT(1).getText().equalsIgnoreCase("FORCETOC") ||
input.LT(1).getText().equalsIgnoreCase("NOGALLERY") ||
input.LT(1).getText().equalsIgnoreCase("NOEDITSECTION")
;
}
}
It would only be a problem if the contents of the magic word
interfered with the lexer - say a combination of letters and other
punctuation. But if the available combinations were predefined (eg,
hyphen hyphen letters digit hyphen hyphen) then they can be dealt
with, and the letters themselves defined at runtime.
Steve