Re: [Wikitech-l] RFC: Replace Tidy with HTML 5 parse/reserialize

13 Aug 2015


      On 8/12/15, MZMcBride z@mzmcbride.com wrote:
...
Tim Starling wrote:
...
https://phabricator.wikimedia.org/T89331
Running the output of the MediaWiki parser through HTML Tidy always
seemed like a nasty hack. The effects on wikitext syntax are arbitrary
and change from version to version. When we upgrade our Linux
distribution, we sometimes see changes in the HTML generated by given
wikitext, which is not ideal.
[...]
We can get nearly the same effect in MediaWiki by replacing the Tidy
transformation stage with an HTML 5 parse followed by serialization of
the DOM back to HTML. This would stabilize wikitext syntax and resolve
several important syntax differences compared to Parsoid.
Related tasks:

https://phabricator.wikimedia.org/T4542
https://phabricator.wikimedia.org/T56617

It's not clear to me which behaviors from Tidy we want to keep. Looking at
the various bugs that Tidy has caused, it's apparent that there a number
of behaviors we want to disable/avoid.
My understanding is that Tidy is not responsible for output sanitization
and it's not responsible for preprocessing or parsing. MediaWiki handles
all of that elsewhere. If Tidy is only needed for mismatched HTML
elements, we could possibly catch and disallow or gracefully handle that
specific use-case in MediaWiki. What other beneficial behavior of Tidy
would we need to replicate?
Or could we replace Tidy with nothing? Relying on the principle of
"garbage in, garbage out" seems reasonable in some ways. And modern
browsers are fairly adept at handling moderately bad HTML.
MZMcBride
The main thing tidy does (imo), is ensure that mismatched html fails
are localized. When somebody makes a mistake, it can cause the entire
skin to go whacko. We ideally want to have markup mistakes only affect
the user generated content (and preferably, only around the area where
the mistake is).
--bawolff

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] RFC: Replace Tidy with HTML 5 parse/reserialize