Re: [Wikitech-l] RFC: Replace Tidy with HTML 5 parse/reserialize

18 Aug 2015

      On 08/18/2015 07:58 AM, MZMcBride wrote:
...
Subramanya Sastry wrote:
...

Unclosed HTML tags (very common)
Misnested tags
Misnesting of tags (ex: links in links .. [http://foo.bar this is a

[[foobar]] company])

Fostered content in tables

(<table>this-content-will-show-up-outside-the-table<tr><td>....
</td></tr></table>)
... this has been one of the biggest source of complexity inside Parsoid
... in combination with templates, this is nasty.
* Other ways in which HTML5 content model might be violated. (ex:
<small>\n*a\n*b\n</small>)
* Look at the parser tests file and see all the tests we've added with
annotations that say "php parser relies on tidy"
I don't see why we would want to incur the maintenance cost of continuing
to support any of these bad inputs. I think we should look to deprecate,
not replace, Tidy. This is a case of the cure being worse than the disease.
Are you suggesting that you get rid of wikitext editing? If not, you 
cannot assume editors are going to write perfect markup.
What is needed is a way to define DOM scopes in wikitext and enforce 
well-formedness within scopes. So, for example, template output can be 
considered a DOM scope (either opt-in or opt-out). If we felt bold, we 
can define a list to be a DOM scope .. or a table to be a DOM scope ... 
or a image caption to be a DOM scope, and so on.
Rather than expect editors to write perfect markup, we should be 
thinking about sane semantics for them like scoping that delimit effects 
of broken markup. With proper semantics, it is easier to reason about 
markup and not rely on whimsical behavior of whatever tool we used 
yesterday or use today or might use tomorrow.
We are working towards these kind of scoping semantics and the first 
step on the way is to get a HTML5 treebuilder / parser in place.
Subbu.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] RFC: Replace Tidy with HTML 5 parse/reserialize