Subramanya Sastry wrote:
- Unclosed HTML tags (very common)
- Misnested tags
- Misnesting of tags (ex: links in links .. [http://foo.bar this is a
[[foobar]] company])
- Fostered content in tables
(<table>this-content-will-show-up-outside-the-table<tr><td>....
</td></tr></table>) ... this has been one of the biggest source of complexity inside Parsoid ... in combination with templates, this is nasty. * Other ways in which HTML5 content model might be violated. (ex: <small>\n*a\n*b\n</small>) * Look at the parser tests file and see all the tests we've added with annotations that say "php parser relies on tidy"
I don't see why we would want to incur the maintenance cost of continuing to support any of these bad inputs. I think we should look to deprecate, not replace, Tidy. This is a case of the cure being worse than the disease.
So, you cannot just rip out Tidy and not replace it with something in its place. Even replacing it with a HTML5 parser (as per the current plan) is not entirely straightforward simply because of all the other unrelated-to-html5-semantics behavior. Part of the task of replacing Tidy is to figure out all the ways those pages might break and the best way to handle that breakage.
We shouldn't rip out Tidy immediately, we should implement a means of disabling Tidy on a per-page or per-user basis and allow the wiki process to correct bad markup over time. Cunningham's Law applies here.
MZMcBride