On 08/17/2015 10:15 PM, MZMcBride wrote:
Failing fast
and loud is good in lots of contexts. I dont think wiki
editing is one of them.
The only cited example of real breakage so far has been
mismatched <div>s.
How often are you or anyone else adding <div>s to pages? In my experience,
most users rely on MediaWiki templates for any kind of complex markup.
Echoing my initial reply in this thread, I still don't really understand
what behaviors from Tidy we want to keep. I've been following
<https://phabricator.wikimedia.org/T89331> a bit and it also hasn't helped
answer this question.
Wikitext is string-based and generates a html string and in the general
case, it need not be well-formed HTML. There is a lot of broken wikitext
out there and if you remove Tidy and don't introduce a HTML5 parser
based balancer, you are going to see a lot of breakage.
* Unclosed HTML tags (very common)
* Misnested tags
* Misnesting of tags (ex: links in links .. [
http://foo.bar this is a
[[foobar]] company])
* Fostered content in tables
(<table>this-content-will-show-up-outside-the-table<tr><td>....</td></tr></table>)
... this has been one of the biggest source of complexity inside Parsoid
... in combination with templates, this is nasty.
* Other ways in which HTML5 content model might be violated. (ex:
<small>\n*a\n*b\n</small>)
* Look at the parser tests file and see all the tests we've added with
annotations that say "php parser relies on tidy"
[[ Tangent: We have a linting option in Parsoid that we can turn on in
production that can dump information about all these broken forms of
wikitext (we have this information because we have to break the wikitext
in the same ways when we convert html to wikitext). We haven't turned it
on in production yet because we haven't yet had the time to hook this
into project wikicheck .. we had initial conversations, but we couldn't
follow up on our end. ]]
Besides these, there is also other unrelated-to-html5-semantics behavior
that wikis have come to rely on.
* Stripping of empty tags -- correct page rendering rely on the fact
that Tidy strips empty elements from HTML. We had to explicitly add this
behavior to Parsoid so pages render identically. We could rip this out
as long as all those templates are fixed up. The infobox on itwiki:Luna
relies on this, to give you a specific example.
* Some behaviors found in
https://phabricator.wikimedia.org/T4542
* I am sure there are a bunch of other behaviors that I am missing /
don't know about.
So, you cannot just rip out Tidy and not replace it with something in
its place. Even replacing it with a HTML5 parser (as per the current
plan) is not entirely straightforward simply because of all the other
unrelated-to-html5-semantics behavior. Part of the task of replacing
Tidy is to figure out all the ways those pages might break and the best
way to handle that breakage.
Going forward, we are thinking about how to enforce stricter constraints
on what templates (and extensions) can produce so impacts from broken
wikitext is contained. That will give you some of what you are asking
("fail fast", but in a different form). That requires a functioning
html5 treebuilder / parser to be in place which is what this RFC is about.
Subbu.