RFC: Replace Tidy with HTML 5 parse/reserialize

List overview All Threads
Download

newer

older

Fwd: Evaluation of opt-in...

2015-Aug-19 Scrum of Scrums...

Tim Starling

12 Aug 2015 12 Aug '15

7:02 a.m.

I'm elevating this task of mine to RFC status:

https://phabricator.wikimedia.org/T89331

Running the output of the MediaWiki parser through HTML Tidy always seemed like a nasty hack. The effects on wikitext syntax are arbitrary and change from version to version. When we upgrade our Linux distribution, we sometimes see changes in the HTML generated by given wikitext, which is not ideal.

Parsoid took a different approach. After token-level transformations, tokens are fed into the HTML 5 parse algorithm, a complex but well-specified algorithm which generates a DOM tree from quirky input text.

http://www.w3.org/TR/html5/syntax.html

We can get nearly the same effect in MediaWiki by replacing the Tidy transformation stage with an HTML 5 parse followed by serialization of the DOM back to HTML. This would stabilize wikitext syntax and resolve several important syntax differences compared to Parsoid.

However:

* I have not been able to find any PHP implementation of this algorithm. Masterminds and Ressio do not even attempt it. Electrolinux attempts it but does not implement the error recovery parts that are of interest to us. * Writing our own would be difficult. * Even if we did write it, it would probably be too slow.

So the question is: what language should we use? Since this is the standard programmer troll question, please bring popcorn.

The best implementation of this algorithm is in Java: the validator.nu parser is maintained by Mozilla, and has source translation to C++, which is used by Mozilla and could potentially be used for an HHVM extension.

There is also a Rust port (also written by Mozilla), and notable implementations in JavaScript and Python.

For WMF, a Java service would be quite easily done, and I have prototyped it already. An HHVM extension might also be possible. A non-service fallback for small installations might be Node.js or a compiled binary from Rust or C++.

-- Tim Starling

Show replies by date

Trevor Parscal

12 Aug 12 Aug

7:16 a.m.

Is it possible use part of the Parsoid code to do this?

- Trevor

On Tuesday, August 11, 2015, Tim Starling tstarling@wikimedia.org wrote:

...

I'm elevating this task of mine to RFC status:

https://phabricator.wikimedia.org/T89331

Running the output of the MediaWiki parser through HTML Tidy always seemed like a nasty hack. The effects on wikitext syntax are arbitrary and change from version to version. When we upgrade our Linux distribution, we sometimes see changes in the HTML generated by given wikitext, which is not ideal.

Parsoid took a different approach. After token-level transformations, tokens are fed into the HTML 5 parse algorithm, a complex but well-specified algorithm which generates a DOM tree from quirky input text.

http://www.w3.org/TR/html5/syntax.html

We can get nearly the same effect in MediaWiki by replacing the Tidy transformation stage with an HTML 5 parse followed by serialization of the DOM back to HTML. This would stabilize wikitext syntax and resolve several important syntax differences compared to Parsoid.

However:

I have not been able to find any PHP implementation of this

algorithm. Masterminds and Ressio do not even attempt it. Electrolinux attempts it but does not implement the error recovery parts that are of interest to us.

Writing our own would be difficult.

Even if we did write it, it would probably be too slow.

So the question is: what language should we use? Since this is the standard programmer troll question, please bring popcorn.

The best implementation of this algorithm is in Java: the validator.nu parser is maintained by Mozilla, and has source translation to C++, which is used by Mozilla and could potentially be used for an HHVM extension.

There is also a Rust port (also written by Mozilla), and notable implementations in JavaScript and Python.

For WMF, a Java service would be quite easily done, and I have prototyped it already. An HHVM extension might also be possible. A non-service fallback for small installations might be Node.js or a compiled binary from Rust or C++.

-- Tim Starling

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org javascript:; https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Gabriel Wicke

7:23 a.m.

On Tue, Aug 11, 2015 at 5:16 PM, Trevor Parscal tparscal@wikimedia.org wrote:

...

Is it possible use part of the Parsoid code to do this?

It is possible to do this in Parsoid (or any node service) with this line:

var sanerHTML = domino.createDocument(input).outerHTML;

However, performance is about 2x worse than current tidy (116ms vs. 238ms for Obama), and about 4x slower than the fastest option in our tests. The task has a lot more benchmarks of various options.

Gabriel

...

Trevor

On Tuesday, August 11, 2015, Tim Starling tstarling@wikimedia.org wrote:

...
I'm elevating this task of mine to RFC status:

https://phabricator.wikimedia.org/T89331

Running the output of the MediaWiki parser through HTML Tidy always seemed like a nasty hack. The effects on wikitext syntax are arbitrary and change from version to version. When we upgrade our Linux distribution, we sometimes see changes in the HTML generated by given wikitext, which is not ideal.

Parsoid took a different approach. After token-level transformations, tokens are fed into the HTML 5 parse algorithm, a complex but well-specified algorithm which generates a DOM tree from quirky input text.

http://www.w3.org/TR/html5/syntax.html

We can get nearly the same effect in MediaWiki by replacing the Tidy transformation stage with an HTML 5 parse followed by serialization of the DOM back to HTML. This would stabilize wikitext syntax and resolve several important syntax differences compared to Parsoid.

However:

I have not been able to find any PHP implementation of this

algorithm. Masterminds and Ressio do not even attempt it. Electrolinux attempts it but does not implement the error recovery parts that are of interest to us.

Writing our own would be difficult.

Even if we did write it, it would probably be too slow.

So the question is: what language should we use? Since this is the standard programmer troll question, please bring popcorn.

The best implementation of this algorithm is in Java: the validator.nu parser is maintained by Mozilla, and has source translation to C++, which is used by Mozilla and could potentially be used for an HHVM extension.

There is also a Rust port (also written by Mozilla), and notable implementations in JavaScript and Python.

For WMF, a Java service would be quite easily done, and I have prototyped it already. An HHVM extension might also be possible. A non-service fallback for small installations might be Node.js or a compiled binary from Rust or C++.

-- Tim Starling

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org javascript:; https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- Gabriel Wicke Principal Engineer, Wikimedia Foundation

Trevor Parscal

7:24 a.m.

Interesting. What is the cause of the slower speed?

- Trevor

On Tuesday, August 11, 2015, Gabriel Wicke gwicke@wikimedia.org wrote:

...

On Tue, Aug 11, 2015 at 5:16 PM, Trevor Parscal <tparscal@wikimedia.org javascript:;> wrote:

...
Is it possible use part of the Parsoid code to do this?

It is possible to do this in Parsoid (or any node service) with this line:

var sanerHTML = domino.createDocument(input).outerHTML;

However, performance is about 2x worse than current tidy (116ms vs. 238ms for Obama), and about 4x slower than the fastest option in our tests. The task has a lot more benchmarks of various options.

Gabriel

...

Trevor

On Tuesday, August 11, 2015, Tim Starling <tstarling@wikimedia.org

javascript:;> wrote:

...
...
I'm elevating this task of mine to RFC status:

https://phabricator.wikimedia.org/T89331

Running the output of the MediaWiki parser through HTML Tidy always seemed like a nasty hack. The effects on wikitext syntax are arbitrary and change from version to version. When we upgrade our Linux distribution, we sometimes see changes in the HTML generated by given wikitext, which is not ideal.

Parsoid took a different approach. After token-level transformations, tokens are fed into the HTML 5 parse algorithm, a complex but well-specified algorithm which generates a DOM tree from quirky input text.

http://www.w3.org/TR/html5/syntax.html

We can get nearly the same effect in MediaWiki by replacing the Tidy transformation stage with an HTML 5 parse followed by serialization of the DOM back to HTML. This would stabilize wikitext syntax and resolve several important syntax differences compared to Parsoid.

However:

I have not been able to find any PHP implementation of this

algorithm. Masterminds and Ressio do not even attempt it. Electrolinux attempts it but does not implement the error recovery parts that are of interest to us.

Writing our own would be difficult.

Even if we did write it, it would probably be too slow.

So the question is: what language should we use? Since this is the standard programmer troll question, please bring popcorn.

The best implementation of this algorithm is in Java: the validator.nu parser is maintained by Mozilla, and has source translation to C++, which is used by Mozilla and could potentially be used for an HHVM extension.

There is also a Rust port (also written by Mozilla), and notable implementations in JavaScript and Python.

For WMF, a Java service would be quite easily done, and I have prototyped it already. An HHVM extension might also be possible. A non-service fallback for small installations might be Node.js or a compiled binary from Rust or C++.

-- Tim Starling

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org javascript:; javascript:; https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org javascript:; https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- Gabriel Wicke Principal Engineer, Wikimedia Foundation _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org javascript:; https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Gabriel Wicke

7:31 a.m.

On Tue, Aug 11, 2015 at 5:24 PM, Trevor Parscal tparscal@wikimedia.org wrote:

...

Interesting. What is the cause of the slower speed?

Mainly a pure-JS DOM implementation (domino) not being quite the same speed as C or Rust with all optimizations turned on. The deltas are roughly in line with language benchmarks like http://benchmarksgame.alioth.debian.org/.

Gabriel

Tim Starling

7:36 a.m.

Language choice. Tidy is written in C. Note that I included shelling out to Node.js as an option in my original post. It's not really part of Parsoid, it's a JavaScript library that Parsoid uses. We would use the same JavaScript library with a few lines of wrapper code.

-- Tim Starling

On 12/08/15 10:24, Trevor Parscal wrote:

...

Interesting. What is the cause of the slower speed?

Trevor

On Tuesday, August 11, 2015, Gabriel Wicke gwicke@wikimedia.org wrote:

...
On Tue, Aug 11, 2015 at 5:16 PM, Trevor Parscal <tparscal@wikimedia.org javascript:;> wrote:

...
Is it possible use part of the Parsoid code to do this?

It is possible to do this in Parsoid (or any node service) with this line:

var sanerHTML = domino.createDocument(input).outerHTML;

However, performance is about 2x worse than current tidy (116ms vs. 238ms for Obama), and about 4x slower than the fastest option in our tests. The task has a lot more benchmarks of various options.

Gabriel

...

Trevor

On Tuesday, August 11, 2015, Tim Starling <tstarling@wikimedia.org

javascript:;> wrote:

...
...
I'm elevating this task of mine to RFC status:

https://phabricator.wikimedia.org/T89331

Running the output of the MediaWiki parser through HTML Tidy always seemed like a nasty hack. The effects on wikitext syntax are arbitrary and change from version to version. When we upgrade our Linux distribution, we sometimes see changes in the HTML generated by given wikitext, which is not ideal.

Parsoid took a different approach. After token-level transformations, tokens are fed into the HTML 5 parse algorithm, a complex but well-specified algorithm which generates a DOM tree from quirky input text.

http://www.w3.org/TR/html5/syntax.html

We can get nearly the same effect in MediaWiki by replacing the Tidy transformation stage with an HTML 5 parse followed by serialization of the DOM back to HTML. This would stabilize wikitext syntax and resolve several important syntax differences compared to Parsoid.

However:

I have not been able to find any PHP implementation of this

algorithm. Masterminds and Ressio do not even attempt it. Electrolinux attempts it but does not implement the error recovery parts that are of interest to us.

Writing our own would be difficult.

Even if we did write it, it would probably be too slow.

So the question is: what language should we use? Since this is the standard programmer troll question, please bring popcorn.

The best implementation of this algorithm is in Java: the validator.nu parser is maintained by Mozilla, and has source translation to C++, which is used by Mozilla and could potentially be used for an HHVM extension.

There is also a Rust port (also written by Mozilla), and notable implementations in JavaScript and Python.

For WMF, a Java service would be quite easily done, and I have prototyped it already. An HHVM extension might also be possible. A non-service fallback for small installations might be Node.js or a compiled binary from Rust or C++.

-- Tim Starling

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org javascript:; javascript:; https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org javascript:; https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- Gabriel Wicke Principal Engineer, Wikimedia Foundation _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org javascript:; https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

MZMcBride

13 Aug 13 Aug

12:43 p.m.

Tim Starling wrote:

...

https://phabricator.wikimedia.org/T89331

Running the output of the MediaWiki parser through HTML Tidy always seemed like a nasty hack. The effects on wikitext syntax are arbitrary and change from version to version. When we upgrade our Linux distribution, we sometimes see changes in the HTML generated by given wikitext, which is not ideal.

[...]

We can get nearly the same effect in MediaWiki by replacing the Tidy transformation stage with an HTML 5 parse followed by serialization of the DOM back to HTML. This would stabilize wikitext syntax and resolve several important syntax differences compared to Parsoid.

It's not clear to me which behaviors from Tidy we want to keep. Looking at the various bugs that Tidy has caused, it's apparent that there a number of behaviors we want to disable/avoid.

My understanding is that Tidy is not responsible for output sanitization and it's not responsible for preprocessing or parsing. MediaWiki handles all of that elsewhere. If Tidy is only needed for mismatched HTML elements, we could possibly catch and disallow or gracefully handle that specific use-case in MediaWiki. What other beneficial behavior of Tidy would we need to replicate?

Or could we replace Tidy with nothing? Relying on the principle of "garbage in, garbage out" seems reasonable in some ways. And modern browsers are fairly adept at handling moderately bad HTML.

MZMcBride

Brian Wolff

12:51 p.m.

On 8/12/15, MZMcBride z@mzmcbride.com wrote:

...

Tim Starling wrote:

...
https://phabricator.wikimedia.org/T89331

Running the output of the MediaWiki parser through HTML Tidy always seemed like a nasty hack. The effects on wikitext syntax are arbitrary and change from version to version. When we upgrade our Linux distribution, we sometimes see changes in the HTML generated by given wikitext, which is not ideal.

[...]

We can get nearly the same effect in MediaWiki by replacing the Tidy transformation stage with an HTML 5 parse followed by serialization of the DOM back to HTML. This would stabilize wikitext syntax and resolve several important syntax differences compared to Parsoid.

Related tasks:

https://phabricator.wikimedia.org/T4542

https://phabricator.wikimedia.org/T56617

It's not clear to me which behaviors from Tidy we want to keep. Looking at the various bugs that Tidy has caused, it's apparent that there a number of behaviors we want to disable/avoid.

My understanding is that Tidy is not responsible for output sanitization and it's not responsible for preprocessing or parsing. MediaWiki handles all of that elsewhere. If Tidy is only needed for mismatched HTML elements, we could possibly catch and disallow or gracefully handle that specific use-case in MediaWiki. What other beneficial behavior of Tidy would we need to replicate?

Or could we replace Tidy with nothing? Relying on the principle of "garbage in, garbage out" seems reasonable in some ways. And modern browsers are fairly adept at handling moderately bad HTML.

MZMcBride

The main thing tidy does (imo), is ensure that mismatched html fails are localized. When somebody makes a mistake, it can cause the entire skin to go whacko. We ideally want to have markup mistakes only affect the user generated content (and preferably, only around the area where the mistake is).

--bawolff

Robert Rohde

1:55 p.m.

Some years back I was importing a large number of complex templates to a wiki that didn't have tidy enabled. The results were nothing short of horrendous in a substantial number of cases. Wiki authors will generally stop worrying about their code as long as the results look right. For good or ill, tidy does a remarkable job of localizing unclosed tags, and often that is enough to effectively fix the appearance of broken HTML syntax so it doesn't spill over into other sections. Without Tidy (or its equivalent) there will be a lot of template garbage that needs to be repaired.

The garbage in -> garbage out approach might seem appealing in principle, but any transition to such a condition is going to dredge up a lot of malformed HTML code created by wiki editors that we've been hiding for many years. If one is going to replace Tidy with something substantially different in execution, I would suggest that one needs a significant test suite of complex pages in order to judge how bad the collateral damage is likely to be, and ideally some set of tools to help editors fix it.

-Robert Rohde

On Thu, Aug 13, 2015 at 7:51 AM, Brian Wolff bawolff@gmail.com wrote:

...

On 8/12/15, MZMcBride z@mzmcbride.com wrote:

...
Tim Starling wrote:

...
https://phabricator.wikimedia.org/T89331

Running the output of the MediaWiki parser through HTML Tidy always seemed like a nasty hack. The effects on wikitext syntax are arbitrary and change from version to version. When we upgrade our Linux distribution, we sometimes see changes in the HTML generated by given wikitext, which is not ideal.

[...]

We can get nearly the same effect in MediaWiki by replacing the Tidy transformation stage with an HTML 5 parse followed by serialization of the DOM back to HTML. This would stabilize wikitext syntax and resolve several important syntax differences compared to Parsoid.

Related tasks:

https://phabricator.wikimedia.org/T4542

https://phabricator.wikimedia.org/T56617

It's not clear to me which behaviors from Tidy we want to keep. Looking

at

...
the various bugs that Tidy has caused, it's apparent that there a number of behaviors we want to disable/avoid.

My understanding is that Tidy is not responsible for output sanitization and it's not responsible for preprocessing or parsing. MediaWiki handles all of that elsewhere. If Tidy is only needed for mismatched HTML elements, we could possibly catch and disallow or gracefully handle that specific use-case in MediaWiki. What other beneficial behavior of Tidy would we need to replicate?

Or could we replace Tidy with nothing? Relying on the principle of "garbage in, garbage out" seems reasonable in some ways. And modern browsers are fairly adept at handling moderately bad HTML.

MZMcBride

The main thing tidy does (imo), is ensure that mismatched html fails are localized. When somebody makes a mistake, it can cause the entire skin to go whacko. We ideally want to have markup mistakes only affect the user generated content (and preferably, only around the area where the mistake is).

--bawolff

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

MZMcBride

16 Aug 16 Aug

12:41 a.m.

Robert Rohde wrote:

...

Some years back I was importing a large number of complex templates to a wiki that didn't have tidy enabled. The results were nothing short of horrendous in a substantial number of cases. Wiki authors will generally stop worrying about their code as long as the results look right. For good or ill, tidy does a remarkable job of localizing unclosed tags, and often that is enough to effectively fix the appearance of broken HTML syntax so it doesn't spill over into other sections. Without Tidy (or its equivalent) there will be a lot of template garbage that needs to be repaired.

As we get saner input mechanisms (CodeEditor, VisualEditor, ScoreEditor, etc.), we'll likely see a reduction in direct HTML editing, which seems to be what most often results in introducing layout-disrupting invalid input.

...

The garbage in -> garbage out approach might seem appealing in principle, but any transition to such a condition is going to dredge up a lot of malformed HTML code created by wiki editors that we've been hiding for many years. If one is going to replace Tidy with something substantially different in execution, I would suggest that one needs a significant test suite of complex pages in order to judge how bad the collateral damage is likely to be, and ideally some set of tools to help editors fix it.

I think dredging up bad input in order to fix it is appropriate. A transition period could include the ability to temporarily render a page without Tidy enabled to see what issues present themselves. As I said previously, browsers are fairly resilient to moderately bad input, but even the really bad code should probably be properly addressed via the wiki process instead of being glossed over with magical fixes and replacements in the form of Tidy.

In addition to following the garbage principle, we would also be following the idea of failing fast and loudly, if the layout gets borked by a missing tag, for example.

(In continuing to think about this problem generally and how other sites/platforms have solved or mitigated it, it's amusing to me that we allow div, span, and inline styling and arbitrary attributes (both of which require separate sanitization), and yet we continue to disallow rendering of the anchor element.)

MZMcBride

Brian Wolff

3:58 a.m.

On Saturday, August 15, 2015, MZMcBride z@mzmcbride.com wrote:

...

Robert Rohde wrote:

...
Some years back I was importing a large number of complex templates to a wiki that didn't have tidy enabled. The results were nothing short of horrendous in a substantial number of cases. Wiki authors will generally stop worrying about their code as long as the results look right. For good or ill, tidy does a remarkable job of localizing unclosed tags, and often that is enough to effectively fix the appearance of broken HTML syntax so it doesn't spill over into other sections. Without Tidy (or its equivalent) there will be a lot of template garbage that needs to be repaired.

As we get saner input mechanisms (CodeEditor, VisualEditor, ScoreEditor, etc.), we'll likely see a reduction in direct HTML editing, which seems to be what most often results in introducing layout-disrupting invalid input.

I dont know about that. Viz editor is targeting ordinary tasks. Its the complex things that mess stuff up.

...

...
The garbage in -> garbage out approach might seem appealing in principle, but any transition to such a condition is going to dredge up a lot of malformed HTML code created by wiki editors that we've been hiding for many years. If one is going to replace Tidy with something substantially different in execution, I would suggest that one needs a significant test suite of complex pages in order to judge how bad the collateral damage is likely to be, and ideally some set of tools to help editors fix it.

I think dredging up bad input in order to fix it is appropriate. A transition period could include the ability to temporarily render a page without Tidy enabled to see what issues present themselves. As I said previously, browsers are fairly resilient to moderately bad input, but even the really bad code should probably be properly addressed via the wiki process instead of being glossed over with magical fixes and replacements in the form of Tidy.

In addition to following the garbage principle, we would also be following the idea of failing fast and loudly, if the layout gets borked by a

missing

...

tag, for example.

Failing fast and loud is good in lots of contexts. I dont think wiki editing is one of them.

...

(In continuing to think about this problem generally and how other sites/platforms have solved or mitigated it, it's amusing to me that we allow div, span, and inline styling and arbitrary attributes (both of which require separate sanitization), and yet we continue to disallow rendering of the anchor element.)

Afaik, anchors are disallowed because spammers commonly insert them. Its trivial to sanitize and allow them if we so desired.

-- bawolff

...

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

MZMcBride

18 Aug 18 Aug

10:15 a.m.

Brian Wolff wrote:

...

I dont know about that. Viz editor is targeting ordinary tasks. Its the complex things that mess stuff up.

In most contexts, solving the ordinary/common cases is a pretty big win.

...

Failing fast and loud is good in lots of contexts. I dont think wiki editing is one of them.

The only cited example of real breakage so far has been mismatched <div>s. How often are you or anyone else adding <div>s to pages? In my experience, most users rely on MediaWiki templates for any kind of complex markup.

Echoing my initial reply in this thread, I still don't really understand what behaviors from Tidy we want to keep. I've been following https://phabricator.wikimedia.org/T89331 a bit and it also hasn't helped answer this question.

...

Afaik, anchors are disallowed because spammers commonly insert them. Its trivial to sanitize and allow them if we so desired.

Spammers can trivially insert anchors (links). Additional wrapper markup isn't even needed; we automatically render hyperlinks if a string has a prefix that looks like it might be a URL. In any case, this is the subject of https://phabricator.wikimedia.org/T35886.

MZMcBride

Subramanya Sastry

12:02 p.m.

On 08/17/2015 10:15 PM, MZMcBride wrote:

...

...
Failing fast and loud is good in lots of contexts. I dont think wiki editing is one of them.

The only cited example of real breakage so far has been mismatched <div>s. How often are you or anyone else adding <div>s to pages? In my experience, most users rely on MediaWiki templates for any kind of complex markup.

Echoing my initial reply in this thread, I still don't really understand what behaviors from Tidy we want to keep. I've been following https://phabricator.wikimedia.org/T89331 a bit and it also hasn't helped answer this question.

Wikitext is string-based and generates a html string and in the general case, it need not be well-formed HTML. There is a lot of broken wikitext out there and if you remove Tidy and don't introduce a HTML5 parser based balancer, you are going to see a lot of breakage.

* Unclosed HTML tags (very common) * Misnested tags * Misnesting of tags (ex: links in links .. [http://foo.bar this is a [[foobar]] company]) * Fostered content in tables (<table>this-content-will-show-up-outside-the-table<tr><td>....</td></tr></table>) ... this has been one of the biggest source of complexity inside Parsoid ... in combination with templates, this is nasty. * Other ways in which HTML5 content model might be violated. (ex: <small>\n*a\n*b\n</small>) * Look at the parser tests file and see all the tests we've added with annotations that say "php parser relies on tidy"

[[ Tangent: We have a linting option in Parsoid that we can turn on in production that can dump information about all these broken forms of wikitext (we have this information because we have to break the wikitext in the same ways when we convert html to wikitext). We haven't turned it on in production yet because we haven't yet had the time to hook this into project wikicheck .. we had initial conversations, but we couldn't follow up on our end. ]]

Besides these, there is also other unrelated-to-html5-semantics behavior that wikis have come to rely on. * Stripping of empty tags -- correct page rendering rely on the fact that Tidy strips empty elements from HTML. We had to explicitly add this behavior to Parsoid so pages render identically. We could rip this out as long as all those templates are fixed up. The infobox on itwiki:Luna relies on this, to give you a specific example. * Some behaviors found in https://phabricator.wikimedia.org/T4542 * I am sure there are a bunch of other behaviors that I am missing / don't know about.

So, you cannot just rip out Tidy and not replace it with something in its place. Even replacing it with a HTML5 parser (as per the current plan) is not entirely straightforward simply because of all the other unrelated-to-html5-semantics behavior. Part of the task of replacing Tidy is to figure out all the ways those pages might break and the best way to handle that breakage.

Going forward, we are thinking about how to enforce stricter constraints on what templates (and extensions) can produce so impacts from broken wikitext is contained. That will give you some of what you are asking ("fail fast", but in a different form). That requires a functioning html5 treebuilder / parser to be in place which is what this RFC is about.

Subbu.

MZMcBride

7:58 p.m.

Subramanya Sastry wrote:

...

Unclosed HTML tags (very common)

Misnested tags

Misnesting of tags (ex: links in links .. [http://foo.bar this is a

[[foobar]] company])

Fostered content in tables

(<table>this-content-will-show-up-outside-the-table<tr><td>....

</td></tr></table>) ... this has been one of the biggest source of complexity inside Parsoid ... in combination with templates, this is nasty. * Other ways in which HTML5 content model might be violated. (ex: <small>\n*a\n*b\n</small>) * Look at the parser tests file and see all the tests we've added with annotations that say "php parser relies on tidy"

I don't see why we would want to incur the maintenance cost of continuing to support any of these bad inputs. I think we should look to deprecate, not replace, Tidy. This is a case of the cure being worse than the disease.

...

So, you cannot just rip out Tidy and not replace it with something in its place. Even replacing it with a HTML5 parser (as per the current plan) is not entirely straightforward simply because of all the other unrelated-to-html5-semantics behavior. Part of the task of replacing Tidy is to figure out all the ways those pages might break and the best way to handle that breakage.

We shouldn't rip out Tidy immediately, we should implement a means of disabling Tidy on a per-page or per-user basis and allow the wiki process to correct bad markup over time. Cunningham's Law applies here.

MZMcBride

Subramanya Sastry

11:54 p.m.

On 08/18/2015 07:58 AM, MZMcBride wrote:

...

Subramanya Sastry wrote:

...

Unclosed HTML tags (very common)

Misnested tags

Misnesting of tags (ex: links in links .. [http://foo.bar this is a

[[foobar]] company])

Fostered content in tables

(<table>this-content-will-show-up-outside-the-table<tr><td>....

</td></tr></table>) ... this has been one of the biggest source of complexity inside Parsoid ... in combination with templates, this is nasty. * Other ways in which HTML5 content model might be violated. (ex: <small>\n*a\n*b\n</small>) * Look at the parser tests file and see all the tests we've added with annotations that say "php parser relies on tidy"

I don't see why we would want to incur the maintenance cost of continuing to support any of these bad inputs. I think we should look to deprecate, not replace, Tidy. This is a case of the cure being worse than the disease.

Are you suggesting that you get rid of wikitext editing? If not, you cannot assume editors are going to write perfect markup.

What is needed is a way to define DOM scopes in wikitext and enforce well-formedness within scopes. So, for example, template output can be considered a DOM scope (either opt-in or opt-out). If we felt bold, we can define a list to be a DOM scope .. or a table to be a DOM scope ... or a image caption to be a DOM scope, and so on.

Rather than expect editors to write perfect markup, we should be thinking about sane semantics for them like scoping that delimit effects of broken markup. With proper semantics, it is easier to reason about markup and not rely on whimsical behavior of whatever tool we used yesterday or use today or might use tomorrow.

We are working towards these kind of scoping semantics and the first step on the way is to get a HTML5 treebuilder / parser in place.

Subbu.

Tobias Oetterer

12:19 p.m.

Hey

...

The only cited example of real breakage so far has been mismatched <div>s. How often are you or anyone else adding <div>s to pages? In my experience, most users rely on MediaWiki templates for any kind of complex markup.

I don't know how up to date this manual page is, but mediawiki.org explicitly states that templates copied from wikipedias may need to have tidy activated in order to work properly [1]. The old Template:Infobox (before Scribunto) sure did...

[1]: https://www.mediawiki.org/wiki/Manual:Using_content_from_Wikipedia#HTMLTidy

Regards, Tobias Oetterer

-- If this email is rather brief, it is not meant to be impolite but to respect your time. http://five.sentenc.es No trees were killed to send this message, but a large number of electrons were terribly inconvenienced

_______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

David Gerard

8:04 p.m.

On 18 August 2015 at 04:15, MZMcBride z@mzmcbride.com wrote:

...

Brian Wolff wrote:

...

...
I dont know about that. Viz editor is targeting ordinary tasks. Its the complex things that mess stuff up.

...

In most contexts, solving the ordinary/common cases is a pretty big win.

Or when it turns a complex task into a simple one, e.g. table editing (one click to remove a column).

- d.

Derk-Jan Hartman

9:48 p.m.

If we want to do away with Tidy, we will have to make all editors perfect html authors, or we risk them damaging pages so much that they potentially can't access the edit button anymore. As far as i'm concerned, this is what Tidy does primarily. Isolate errors in the content in such a way that it cannot influence the rest of the interface of the website. And yes I do regularly see such problem in MediaWiki instances that do not run Tidy.

Rule one of security. Always have multiple layers of defense. Yes we should reduce the amount of problems and make them more visible, but that doesn't mean we don't still need a correctional method as a fallback.

On Tue, Aug 18, 2015 at 3:04 PM, David Gerard dgerard@gmail.com wrote:

...

On 18 August 2015 at 04:15, MZMcBride z@mzmcbride.com wrote:

...
Brian Wolff wrote:

...
...
I dont know about that. Viz editor is targeting ordinary tasks. Its the complex things that mess stuff up.

...
In most contexts, solving the ordinary/common cases is a pretty big win.

Or when it turns a complex task into a simple one, e.g. table editing (one click to remove a column).

d.

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Mr. Stradivarius

11:16 p.m.

On Tue, Aug 18, 2015 at 11:48 PM, Derk-Jan Hartman < d.j.hartman+wmf_ml@gmail.com> wrote:

...

If we want to do away with Tidy, we will have to make all editors perfect html authors

In my experience, mismatched tags are quite often used on purpose. For example, Cyberpower678 has two unmatched div tags at the end of his StandardLayout template https://en.wikipedia.org/wiki/User:Cyberpower678/StandardLayout, used to put a shaded border round the posts on his talk page https://en.wikipedia.org/wiki/User_talk:Cyberpower678. There are no corresponding closing div tags at the end of the talk page, as they would be moved by the talk page archive bot, and Tidy takes care of the invalid HTML anyway.

Bartosz Dziewoński

10:17 p.m.

On Tue, 18 Aug 2015 05:15:05 +0200, MZMcBride z@mzmcbride.com wrote:

...

The only cited example of real breakage so far has been mismatched

<div>s. How often are you or anyone else adding <div>s to pages? In my experience, most users rely on MediaWiki templates for any kind of complex markup.

Echoing my initial reply in this thread, I still don't really understand what behaviors from Tidy we want to keep. I've been following https://phabricator.wikimedia.org/T89331 a bit and it also hasn't helped answer this question.

Mismatched any tags. A an opening <foo> or closing </foo> tag without a pair can wreak havoc on the entire page, including the interface.

I recall reports of unclosed <small> or <b> reducing the font size of or bolding the entire page. I can't find that one, but here's a small collection of bugs caused by Tidy unintentionally not running in various contexts: T27888 T29889 T40273 T44016 T60042 T60439.

You could easily engineer this to hide the tabs if you were malicious (making it impossible for casual users to edit the page, say, to fix the broken markup), and it might even be doable by accident.

We really do need this feature. Not anything else that Tidy does, most of its behavior is actually damaging, but we need to match the open and close tags to prevent the interface from getting jumbled.

-- Bartosz Dziewoński

Tim Starling

19 Aug 19 Aug

7:52 p.m.

On 13/08/15 15:43, MZMcBride wrote:

...

Or could we replace Tidy with nothing? Relying on the principle of "garbage in, garbage out" seems reasonable in some ways. And modern browsers are fairly adept at handling moderately bad HTML.

The HTML 5 spec makes a distinction between valid, balanced HTML and error recovery algorithms. Browsers are basically the only clients able to handle moderately bad HTML, and as I've previously said in discussions of HTML 5 output, I don't think it is acceptable to screw over all non-browser clients by sending output that relies on obscure details of the HTML 5 spec. I think XHTML or something close to it is an appropriate machine-readable output format.

Have you looked at my survey on the bug? Compliant HTML 5 parsers are 10-30k source lines and are in pretty short supply.

Wikitext is not meant to be easily machine-readable, it is meant to be easily human-writable. Unbalanced tags in HTML are errors, but in wikitext they are allowed. This is a design choice. Most humans don't really care about the spec, they just want the machine to figure out what they meant.

And, as several others have noted, you can't just disable Tidy, since the effects of unclosed tags are not confined to the content area, and there is a large amount of existing content that depends on it. I have seen the effects of Tidy being accidentally disabled on the English Wikipedia, it is not pleasant.

Am I correct in saying that MZMcBride is the only person in this thread in favour of the idea of getting rid of HTML cleanup?

By the way, you can see my work in progress on an HTML reserializer web service in the mediawiki/services/html5depurate project on Gerrit:

https://gerrit.wikimedia.org/r/#/q/status:open+project:mediawiki/services/html5depurate+branch:master,n,z

-- Tim Starling

MZMcBride

8:22 p.m.

Tim Starling wrote:

...

The HTML 5 spec makes a distinction between valid, balanced HTML and error recovery algorithms. Browsers are basically the only clients able to handle moderately bad HTML, and as I've previously said in discussions of HTML 5 output, I don't think it is acceptable to screw over all non-browser clients by sending output that relies on obscure details of the HTML 5 spec. I think XHTML or something close to it is an appropriate machine-readable output format.

Machine-readable output format? Are you suggesting that there would be a change from the current policy of telling everyone who screen-scrapes HTML not to ever do it and to instead use api.php? Otherwise, given that the majority of our actual traffic comes from actual browsers, as I understand it, I'm not sure I see which clients you're trying to serve.

...

And, as several others have noted, you can't just disable Tidy, since the effects of unclosed tags are not confined to the content area, and there is a large amount of existing content that depends on it. I have seen the effects of Tidy being accidentally disabled on the English Wikipedia, it is not pleasant.

Am I correct in saying that MZMcBride is the only person in this thread in favour of the idea of getting rid of HTML cleanup?

I think it depends what you mean by "HTML cleanup." Are you referring only to "fixing" mismatched HTML elements or are you also referring to reimplementing all of the other behavior that Tidy brings in?

Bartosz wrote:

...

We really do need this feature. Not anything else that Tidy does, most of its behavior is actually damaging, but we need to match the open and close tags to prevent the interface from getting jumbled.

My reading of this thread is that this is the consensus view. The problem, as I see it, is that Tidy has been deployed long enough that some users are also relying on all of its other bad behaviors. It seems to me that a replacement for Tidy either has to reimplement all of its unwanted behaviors to avoid breakage with current wikitext or it has to break an unknown amount of current wikitext.

MZMcBride

Brandon Black

8:46 p.m.

On Wed, Aug 19, 2015 at 1:22 PM, MZMcBride z@mzmcbride.com wrote:

...

Bartosz wrote:

...
We really do need this feature. Not anything else that Tidy does, most of its behavior is actually damaging, but we need to match the open and close tags to prevent the interface from getting jumbled.

My reading of this thread is that this is the consensus view. The problem, as I see it, is that Tidy has been deployed long enough that some users are also relying on all of its other bad behaviors. It seems to me that a replacement for Tidy either has to reimplement all of its unwanted behaviors to avoid breakage with current wikitext or it has to break an unknown amount of current wikitext.

My $0.02 from the peanut gallery: If we fixed up the bulk of the most common cases we can (where the bad HTML is not the result of an edit error), could we keep a Tidy/HTML5 type of thing around, but move it to edit validation rather than render output processing? We could start by leaving the current output-side code alone, and warning (to the user as a minor info blurb on edit submission, and in our logs) about edits that fail validation, so that we can get some idea of the scope and causes of the problem, fix what we can, and then evaluate whether we can eventually start flat-out rejecting the minority of edits that fail validation and then eventually remove the tidy on the output side. That ignores the whole problem of existing bad html already in the DB, of course, but that could probably be fixed with a one-time job...

Ricordisamoa

9:09 p.m.

Il 19/08/2015 15:46, Brandon Black ha scritto:

...

On Wed, Aug 19, 2015 at 1:22 PM, MZMcBride z@mzmcbride.com wrote:

...
Bartosz wrote:

...
We really do need this feature. Not anything else that Tidy does, most of its behavior is actually damaging, but we need to match the open and close tags to prevent the interface from getting jumbled.

My reading of this thread is that this is the consensus view. The problem, as I see it, is that Tidy has been deployed long enough that some users are also relying on all of its other bad behaviors. It seems to me that a replacement for Tidy either has to reimplement all of its unwanted behaviors to avoid breakage with current wikitext or it has to break an unknown amount of current wikitext.

My $0.02 from the peanut gallery: If we fixed up the bulk of the most common cases we can (where the bad HTML is not the result of an edit error), could we keep a Tidy/HTML5 type of thing around, but move it to edit validation rather than render output processing? We could start by leaving the current output-side code alone, and warning (to the user as a minor info blurb on edit submission, and in our logs) about edits that fail validation, so that we can get some idea of the scope and causes of the problem, fix what we can, and then evaluate whether we can eventually start flat-out rejecting the minority of edits that fail validation and then eventually remove the tidy on the output side. That ignores the whole problem of existing bad html already in the DB, of course, but that could probably be fixed with a one-time job...

Keep in mind that a lot of templates intentionally consist of 'broken' HTML that is then 'put back together' in articles...

Subramanya Sastry

9:44 p.m.

On 08/19/2015 08:22 AM, MZMcBride wrote:

...

...
And, as several others have noted, you can't just disable Tidy, since the effects of unclosed tags are not confined to the content area, and there is a large amount of existing content that depends on it. I have seen the effects of Tidy being accidentally disabled on the English Wikipedia, it is not pleasant.

Am I correct in saying that MZMcBride is the only person in this thread in favour of the idea of getting rid of HTML cleanup?

I think it depends what you mean by "HTML cleanup." Are you referring only to "fixing" mismatched HTML elements or are you also referring to reimplementing all of the other behavior that Tidy brings in?

Bartosz wrote:

...
We really do need this feature. Not anything else that Tidy does, most of its behavior is actually damaging, but we need to match the open and close tags to prevent the interface from getting jumbled.

My reading of this thread is that this is the consensus view. The problem, as I see it, is that Tidy has been deployed long enough that some users are also relying on all of its other bad behaviors. It seems to me that a replacement for Tidy either has to reimplement all of its unwanted behaviors to avoid breakage with current wikitext or it has to break an unknown amount of current wikitext.

In response to both these queries, see this snippet from my earlier post on this thread ( https://lists.wikimedia.org/pipermail/wikitech-l/2015-August/082806.html )

"Even replacing it with a HTML5 parser (as per the current plan) is not entirely straightforward simply because of all the other unrelated-to-html5-semantics behavior. Part of the task of replacing Tidy is to figure out all the ways those pages might break and the best way to handle that breakage."

Also see https://phabricator.wikimedia.org/T89331#1499979 about how we might go about evaluating this.

So, we aren't saying we'll implement those Tidy behaviors here. Part of the solution might very well be to break some of that Tidy behavior and have the pages be fixed up (bots, manually, however). In any case, the first step is to understand those impacts.

Subbu.

Erwin Dokter

10:21 p.m.

I mentioned this once before:

http://www.htacg.org/tidy-html5/

While Tidy died in 2008, this fork lives on and is HTML5 aware. That will at least solve a lot of problems *caused* by Tidy, such as not allowing block elements inside inline elemensts (which is allowed in HTML5).

Can we at least evaluate if this is a suitable interim solution?

Regards,

-- Erwin Dokter

Tim Starling

20 Aug 20 Aug

7:32 a.m.

On 20/08/15 01:21, Erwin Dokter wrote:

...

I mentioned this once before:

http://www.htacg.org/tidy-html5/

While Tidy died in 2008, this fork lives on and is HTML5 aware. That will at least solve a lot of problems *caused* by Tidy, such as not allowing block elements inside inline elemensts (which is allowed in HTML5).

Can we at least evaluate if this is a suitable interim solution?

That's not a solution to the problems that we are trying to solve.

As I said in my original post, my number one problem with Tidy is that it changes. So I am very happy that it is not in active development. Switching to a fork that is actively maintained would be much worse. It would be like the switch from Tidy to the proposed HTML reserializer web service, except that the pain would be repeated every time we upgrade our Linux distribution.

The other problem with Tidy is that it is poorly specified and has only one implementation. Switching to a fork of it doesn't improve the situation.

HTML 5 has not significantly relaxed the rules about block elements inside inline elements. The terminology has changed: now instead of inline elements we have "phrasing content" and instead of block elements we have "flow content". You're still not allowed to put a <div> inside a <span>, because <span> is phrasing content and <div> isn't.

The "children" column here has a summary:

http://www.w3.org/TR/html5/index.html#elements-1

-- Tim Starling

Bartosz Dziewoński

9:06 p.m.

On Thu, 20 Aug 2015 02:32:12 +0200, Tim Starling tstarling@wikimedia.org wrote:

...

On 20/08/15 01:21, Erwin Dokter wrote:

...
I mentioned this once before:

http://www.htacg.org/tidy-html5/

While Tidy died in 2008, this fork lives on and is HTML5 aware. That will at least solve a lot of problems *caused* by Tidy, such as not allowing block elements inside inline elemensts (which is allowed in HTML5).

HTML 5 has not significantly relaxed the rules about block elements inside inline elements. The terminology has changed: now instead of inline elements we have "phrasing content" and instead of block elements we have "flow content". You're still not allowed to put a

<div> inside a <span>, because <span> is phrasing content and <div> isn't.

Erwin might be referring to T73962 (<a><div>Foo</div></a> is changed to <a></a><div>Foo</div> by Tidy), which is related to a change in semantics in HTML 5 (previously <a> was an inline element, now it is "transparent").

[T73962] https://phabricator.wikimedia.org/T73962

-- Bartosz Dziewoński

3429

Age (days ago)

3437

Last active (days ago)

wikitech-l@lists.wikimedia.org

27 comments

15 participants

tags (0)

participants (15)

Bartosz Dziewoński
Brandon Black
Brian Wolff
David Gerard
Derk-Jan Hartman
Erwin Dokter
Gabriel Wicke
Mr. Stradivarius
MZMcBride
Ricordisamoa
Robert Rohde
Subramanya Sastry
Tim Starling
Tobias Oetterer
Trevor Parscal