On 08/18/2015 07:58 AM, MZMcBride wrote:
Subramanya Sastry wrote:
- Unclosed HTML tags (very common)
- Misnested tags
- Misnesting of tags (ex: links in links .. [http://foo.bar this is a
[[foobar]] company])
- Fostered content in tables
(<table>this-content-will-show-up-outside-the-table<tr><td>....
</td></tr></table>) ... this has been one of the biggest source of complexity inside Parsoid ... in combination with templates, this is nasty. * Other ways in which HTML5 content model might be violated. (ex: <small>\n*a\n*b\n</small>) * Look at the parser tests file and see all the tests we've added with annotations that say "php parser relies on tidy"
I don't see why we would want to incur the maintenance cost of continuing to support any of these bad inputs. I think we should look to deprecate, not replace, Tidy. This is a case of the cure being worse than the disease.
Are you suggesting that you get rid of wikitext editing? If not, you cannot assume editors are going to write perfect markup.
What is needed is a way to define DOM scopes in wikitext and enforce well-formedness within scopes. So, for example, template output can be considered a DOM scope (either opt-in or opt-out). If we felt bold, we can define a list to be a DOM scope .. or a table to be a DOM scope ... or a image caption to be a DOM scope, and so on.
Rather than expect editors to write perfect markup, we should be thinking about sane semantics for them like scoping that delimit effects of broken markup. With proper semantics, it is easier to reason about markup and not rely on whimsical behavior of whatever tool we used yesterday or use today or might use tomorrow.
We are working towards these kind of scoping semantics and the first step on the way is to get a HTML5 treebuilder / parser in place.
Subbu.