Steve Bennett wrote:
n 11/21/07, Tim Starling tstarling@wikimedia.org wrote:
<snip> > The parser pass order has changed from <snip> > * Template and extension tag parse to intermediate representation > * Template expansion and extension rendering > * HTML normalisation and security > * Main section...
<snip> > The intermediate representation I have used is a DOM document tree, taking
<snip>
Uncovered main-pass syntax, such as HTML tags, are now generally valid, whereas previously in some cases they were escaped. For example, you could have "<ta" in one template, and "ble>" in another template, and put them together to make a valid <table> tag. Previously the result would have been "<table>".
I'm not sure I grok the impact of these changes. Say you have template A defined as follows:
''italics'' '''open-bold <b
which is called from an article with text
Crazy: {{A}}r /> stuff'''.
What text does the parser see exactly? Will it see a mixture of both rendered and un-rendered HTML? ie:
Crazy: <i>italics</i> '''open-bold <br /> stuff'''.
I'm guessing not, because obviously not all HTML is valid input to the parser (unlike the <i> in this case). Would you mind explaining a bit more?
Apostrophes are converted to HTML in doAllQuotes(). Invalid HTML on input is cleaned up in removeHTMLtags(). Both are now considered to be *after* the preprocessor. So in your example, the preprocessor will produce:
Crazy: ''italics'' '''open-bold <br /> stuff'''.
I was thinking this over and wondering whether there would be benefit in an explict tag in templates to mark code that should be passed unrendered through to its calling page. Something like <includeraw><ta</includeraw> But since I haven't really understood how the new PP works, this probably isn't necessary. ?
The only thing that really needs escaping from the preprocessor are the characters "{|=}", and "<" when it occurs before the name of a registered tag hook. For "|" there is the old hack {{!}}, a template which contains just "|". This takes advantage of the uncovered syntax rules in the preprocessor to hide a character from the preprocessor, passing through a literal "|" to the main pass. It's used for table syntax. This mechanism could be extended and standardised, say with a "urldecode" parser function, to put any arbitrary character into the preprocessor output.
Tags such as <gallery> work by an uglier and more fragile method, i.e. with strip markers. Strip markers are placeholders passed through to the main pass, where they hopefully not mangled too badly. They are then substituted with their rendered value, potentially destroying whatever HTML the intervening passes put in their vicinity, then finally doBlockLevels() is run, which mangles their HTML output unless the tag hook writer carefully armoured it.
-- Tim Starling