On 08/17/2015 10:15 PM, MZMcBride wrote:
Failing fast and loud is good in lots of contexts. I dont think wiki editing is one of them.
The only cited example of real breakage so far has been mismatched <div>s. How often are you or anyone else adding <div>s to pages? In my experience, most users rely on MediaWiki templates for any kind of complex markup.
Echoing my initial reply in this thread, I still don't really understand what behaviors from Tidy we want to keep. I've been following https://phabricator.wikimedia.org/T89331 a bit and it also hasn't helped answer this question.
Wikitext is string-based and generates a html string and in the general case, it need not be well-formed HTML. There is a lot of broken wikitext out there and if you remove Tidy and don't introduce a HTML5 parser based balancer, you are going to see a lot of breakage.
* Unclosed HTML tags (very common) * Misnested tags * Misnesting of tags (ex: links in links .. [http://foo.bar this is a [[foobar]] company]) * Fostered content in tables (<table>this-content-will-show-up-outside-the-table<tr><td>....</td></tr></table>) ... this has been one of the biggest source of complexity inside Parsoid ... in combination with templates, this is nasty. * Other ways in which HTML5 content model might be violated. (ex: <small>\n*a\n*b\n</small>) * Look at the parser tests file and see all the tests we've added with annotations that say "php parser relies on tidy"
[[ Tangent: We have a linting option in Parsoid that we can turn on in production that can dump information about all these broken forms of wikitext (we have this information because we have to break the wikitext in the same ways when we convert html to wikitext). We haven't turned it on in production yet because we haven't yet had the time to hook this into project wikicheck .. we had initial conversations, but we couldn't follow up on our end. ]]
Besides these, there is also other unrelated-to-html5-semantics behavior that wikis have come to rely on. * Stripping of empty tags -- correct page rendering rely on the fact that Tidy strips empty elements from HTML. We had to explicitly add this behavior to Parsoid so pages render identically. We could rip this out as long as all those templates are fixed up. The infobox on itwiki:Luna relies on this, to give you a specific example. * Some behaviors found in https://phabricator.wikimedia.org/T4542 * I am sure there are a bunch of other behaviors that I am missing / don't know about.
So, you cannot just rip out Tidy and not replace it with something in its place. Even replacing it with a HTML5 parser (as per the current plan) is not entirely straightforward simply because of all the other unrelated-to-html5-semantics behavior. Part of the task of replacing Tidy is to figure out all the ways those pages might break and the best way to handle that breakage.
Going forward, we are thinking about how to enforce stricter constraints on what templates (and extensions) can produce so impacts from broken wikitext is contained. That will give you some of what you are asking ("fail fast", but in a different form). That requires a functioning html5 treebuilder / parser to be in place which is what this RFC is about.
Subbu.