I'm in the process of rewriting the old Sanitizer::removeHTMLTags to work much better. The new code properly closes implied end-tags and obeys some additional HTML rules about what can go where.
In-progress patches posted at: http://bugzilla.wikimedia.org/show_bug.cgi?id=5497
Before I finish this up, though, it would be good if we can agree on how to handle a few things.
** HTML across template boundaries
Right now there's a big behavior difference between the regular mode and the behavior with Tidy enabled. In regular mode, the HTML nesting and closing rules are separately applied to every transcluded text chunk. In Tidy mode, only the allowed-HTML check is applied at that stage, and nesting and closing is left for Tidy to fix things up at the very end.
An example of a construct that breaks is a template that defines a table header like:
<table class="fooba">
and is included like this:
{{cool-table-start}} {{cool-row|blah}} {{cool-table-end}}
In current non-Tidy mode, this breaks violently as the <table> gets closed in the first template, and then all the following <tr>, <td> etc are rejected as they're not allowed in body text.
In current Tidy mode this is allowed to pass on through just fine; the pieces are assembled and then checked for nesting later.
I really don't like this kind of construct as it makes it harder to treat transclusions at the abstract-parse-tree level in the future; in order to understand the markup _following_ the transclusion you need to have already expanded it. Yucky!
However the current system allows the same thing to work with wiki tables (eg {|class="fooba") in either mode. I'm pretty sure at least the latter are in fairly common use on Wikipedia.
So we either need to decide to Kill Them All, or accept the sacrifice for compatibility.
** Inline HTML across wiki blocks
Currently, removeHTMLTags is applied before most other parsing steps, most notably doBlockLevels which handles paragraph splitting, wiki lists, etc.
A consequence of this is that bad nesting / illegal overlapping can occur with a construct like this:
<b>First paragraph
Second paragraph
The HTML normalizer adds the missing close tag:
<b>First paragraph
Second paragraph</b>
and later the wiki block levels adds <p> tags:
<p><b>First paragraph </p><p>Second paragraph</b> </p>
This is fairly obviously incorrect; it _probably_ would make a reasonable amount of sense to rework how the block levels interact with stuff so it happens either up before, or in concert with, the HTML normalization.
** Mixing of HTML and wiki tables
Running tests on pages from French Wikipedia, I found a cute bugger that does something like this:
{| <caption>A table caption</caption> |- |blah |}
Since tables haven't been replaced in the output yet, this <caption> is in a <body> context as far as the HTML normalizer sees and it fails. But the old code let it through, in both tidy and non-tidy mode.
While this kind of admixture looks *supremely ugly* to me, do we have any reason to disallow it?
Should we think of the wiki table syntax as just a shortcut/transformation to HTML table tags, or should they be entirely separate entities?
-- brion vibber (brion @ pobox.com)