Subramanya Sastry wrote:
* Unclosed HTML tags (very common)
* Misnested tags
* Misnesting of tags (ex: links in links .. [
http://foo.bar this is a
[[foobar]] company])
* Fostered content in tables
(<table>this-content-will-show-up-outside-the-table<tr><td>....
</td></tr></table>)
... this has been one of the biggest source of complexity inside Parsoid
... in combination with templates, this is nasty.
* Other ways in which HTML5 content model might be violated. (ex:
<small>\n*a\n*b\n</small>)
* Look at the parser tests file and see all the tests we've added with
annotations that say "php parser relies on tidy"
I don't see why we would want to incur the maintenance cost of continuing
to support any of these bad inputs. I think we should look to deprecate,
not replace, Tidy. This is a case of the cure being worse than the disease.
So, you cannot just rip out Tidy and not replace it
with something in
its place. Even replacing it with a HTML5 parser (as per the current
plan) is not entirely straightforward simply because of all the other
unrelated-to-html5-semantics behavior. Part of the task of replacing
Tidy is to figure out all the ways those pages might break and the best
way to handle that breakage.
We shouldn't rip out Tidy immediately, we should implement a means of
disabling Tidy on a per-page or per-user basis and allow the wiki process
to correct bad markup over time. Cunningham's Law applies here.
MZMcBride