The RFC proposal for "hygienic templates" got improved a bunch -- and renamed to "balanced templates" -- during the parsing team off-site. It seems like it's worth resending this to the list. Comments on the updated draft welcome!
----
As described in my Wikimania 2015 talk https://wikimania2015.wikimedia.org/wiki/Submissions/Templates_are_dead!_Long_live_templates! (starting at slide 27 https://wikimania2015.wikimedia.org/w/index.php?title=File:Templates_are_dead!_Long_live_templates!.pdf&page=27), there are a number of reasons to mark certain templates as "balanced". Foremost among them: to allow high-performance incremental update of page contents after templates are modified, and to allow safe editing of template uses using HTML-based tools such as Visual Editor or jsapi https://doc.wikimedia.org/Parsoid/master/#!/guide/jsapi.
This means (roughly) that the output of the template is a complete DocumentFragment https://developer.mozilla.org/en-US/docs/Web/API/DocumentFragment: every open tag is closed and there are no nodes which the HTML adoption agency algorithm http://dev.w3.org/html5/spec-LC/tree-construction.html#adoptionAgency will reorder. (More precise details below.)
Template balance is enforced: tags are closed or removed as necessary to ensure that the output satisfies the necessary constraints, regardless of the values of the template arguments or how child templates are expanded. You can imagine this as running tidy (or something like it https://phabricator.wikimedia.org/T89331) on the template output before it is inserted into the document; but see below for the actual implementation.
The primary benefit of balanced templates is allowing efficient update of articles by doing substring substitution for template bodies, without having to expand all templates to wikitext and reparse from scratch. It also guarantees that the template (and surrounding content) will be editable in Visual Editor; mistakes in template arguments won't "leak out" and prevent editing of surrounding content.
***Wikitext Syntax***
After some bikeshedding, we decided that balance should be an "opt-in" property of templates, indicated by adding a `{{#balance:TYPE}}` marker to the content. This syntax leverages the existing "parser function" syntax, and allows for different types of balance to be named where `TYPE` is.
We propose three forms of balance, of which only the first is likely to be implemented initially. Other balancing modes would provide safety in different HTML-parsing contexts. We've named two below; more might be added in the future if there is need.
1. `{{#balance:block}}` would close any open `<p>`/`<a>`/`<h*>`/`<table>` tags in the article preceding the template insertion site. In the template content all tags left open at the end will be closed, but there is no other restriction. This is similar to how block-level tags work in HTML 5. This is useful for navboxes and other "block" content. 2. `{{#balance:inline}}` would only allow inline (i.e. phrasing) content and generate an error if a `<p>`/`<a>`/`<h*>`/`<table>`/`<tr>`/`<td>`/` <th>`/`<li>` tag is seen in the content. But because of this, it //*can*// be used inside a block-level context without closing active `<p>`/`<a>`/` <h*>`/`<table>` in the article (as `{{#balance:block}}` would). This is useful for simple plain text templates, e.g. age calculation. 3. `{{#balance:table}}` would close `<p>`/`<a>`/`<h*>` but would allow insertion inside `<table>` and allow `<td>`/`<th>` tags in the content. (There might be some other content restrictions to prevent fostering.)
We expect `{{#balance:block}}` to be most useful for the large-ish templates whose efficient replacement would make the most impact on performance, and so we propose `{{#balance:}}` as a possible shorthand for ` {{#balance:block}}`. (The current wikitext grammar does not allow ` {{#balance}}`, since the trailing colon is required in parser function names, but if desired we could probably accommodate that abbreviation as well without too much pain.)
Violations of content restrictions (ie, a `<p>` tag in a ` {{#balance:inline}}` template) would be errors, but how these errors would be conveyed is an orthogonal issue. Some options for error reporting include ugly bold text visible to readers (like `{{cite}}`), wikilint-like reports, or inclusion in `[[Category:Balance Errors]]`. Note that errors might not appear immediately: they may only occur when some other included template is edited to newly produce disallowed content, or only when certain values are passed as template arguments.
***Implementation***
Implementation is slightly different in the PHP parser and in Parsoid. Incremental parsing/update would necessarily not be done in the PHP parser, but it does need to enforce equivalent content model constraints for consistency.
PHP parser implementation strategy:
- When a template with `{{#balance}}` is expanded, add a marker to the start of its output. - In the Sanitizer leave that marker alone, and then just before handling the output to tidy/depurate https://phabricator.wikimedia.org/T89331 we'll replace the marker with `</p></table>...etc...`. That pass will close the tags (and discard any irrelevant `</...>` tags). Some care needed to ensure we discard unnecessary close tags, and not html-entity-escape them. - PHP might not be able to implement `{{#balance:inline}}` or ` {{#balance:table}}` quite yet -- there might need to be a special depurate mode, or do it in a DOM-based sanitizer, something like that. We can concentrate on `{{#balance:block}}` initially.
In Parsoid:
- We just need to emit synthetic `</p></table></...>` tokens, the tree builder will take care of closing a tag if necessary or else discarding the token. - When PHP switches over to a DOM-based sanitizer, it might be able to use this same strategy.
***Deployment***
Unmarked templates are "unbalanced" and will render exactly the same as before, they will just be slower (require more CPU time) than balanced templates.
It is expected that we will profile the "costliest"/"most frequently used/changed" templates on wikimedia projects and attempt to add balance markers first to those templates where the greatest potential performance gain may be achieved. Tim Starling noticed that adding a balance marker to `[[:en:Template:Infobox]] https://en.wikipedia.org/wiki/Template:Infobox` could affect over two million pages and have a large immediate effect on performance. We would want to carefully verify first that balance would not affect the appearance of any of those pages, using visual diff or other tools.
Related: {T89331 https://phabricator.wikimedia.org/T89331}, {T114072 https://phabricator.wikimedia.org/T114072}.
On Tue, Nov 10, 2015 at 1:40 PM, C. Scott Ananian cananian@wikimedia.org wrote:
- `{{#balance:inline}}` would only allow inline (i.e. phrasing) content
and generate an error if a `<p>`/`<a>`/`<h*>`/`<table>`/`<tr>`/`<td>`/`
<th>`/`<li>` tag is seen in the content.
Why is <a> in that list? It's not flow/block content model and filtering it out would severely restrict the usefulness of inline templates.
- In the Sanitizer leave that marker alone, and then just before
handling the output to tidy/depurate https://phabricator.wikimedia.org/T89331 we'll replace the marker with `</p></table>...etc...`. That pass will close the tags (and discard any irrelevant `</...>` tags). Some care needed to ensure we discard unnecessary close tags, and not html-entity-escape them.
(...)
- We just need to emit synthetic `</p></table></...>` tokens, the tree
builder will take care of closing a tag if necessary or else discarding the token.
What if there are multiple levels of unclosed tags?
On Wed, Nov 11, 2015 at 8:15 PM, Gergo Tisza gtisza@wikimedia.org wrote:
On Tue, Nov 10, 2015 at 1:40 PM, C. Scott Ananian cananian@wikimedia.org wrote:
- `{{#balance:inline}}` would only allow inline (i.e. phrasing)
content
and generate an error if a
`<p>`/`<a>`/`<h*>`/`<table>`/`<tr>`/`<td>`/`
<th>`/`<li>` tag is seen in the content.
Why is <a> in that list? It's not flow/block content model and filtering it out would severely restrict the usefulness of inline templates.
That's a good point. The problem is nested <a> tags. So we can either ban open <a> tags from the context, or ban <a> tags from the content. Or split things and have {{#balance:link}} vs {{#balance:inline}} or something like that. Feedback welcome! The details of HTML parsing are hairy, and I wouldn't be surprised if we need slight tweaks to things when we actually get to the point of implementation.
To be extra specific: if the context is:
<p>hello, there <a>friend <!-- template goes here --></p>
and the template is "foo <a>bar</a>", then HTML5 parsing will produce:
<p>hello, there <a>friend foo </a><a>bar</a></p>
where the added closing </a> tag inside the template body prevents "simple substitution" of the template contents.
- We just need to emit synthetic `</p></table></...>` tokens, the tree
builder will take care of closing a tag if necessary or else
discarding
the token.
What if there are multiple levels of unclosed tags?
We basically emit enough unclosed tags to close anything which might be open, and let the tidy phase discard any which are not applicable.
Off-hand, I think the only tag where nesting would be an issue would be <table>. So I guess we'll need the Sanitizer to count the open <table> tags so we can be sure to emit enough close tags. Tricky! --scott
In this repost I forgot to mention that the phab task for `{{#balance}}` is https://phabricator.wikimedia.org/T114445.
I copied the excellent points made by Gergo into a comment there. --scott
wikitech-l@lists.wikimedia.org