Re: [Wikitech-l] [Engineering] Parsoid announcement: Main roundtrip quality target achieved

29 Jun 2015

      On 06/29/2015 09:19 AM, Brad Jorsch (Anomie) wrote:
...
On Thu, Jun 25, 2015 at 6:22 PM, Subramanya Sastry 
<ssastry@wikimedia.org mailto:ssastry@wikimedia.org> wrote:
* Pare down rendering differences between the two systems so that
  we can start thinking about using Parsoid HTML instead of
MWParser HTML
  for read views. ( https://phabricator.wikimedia.org/T55784 )

Any hope of adding the Parsoid metadata to the MWParser HTML so 
various fancy things can be done in core MediaWiki for smaller 
installations instead of having to run a separate service? Or does 
that fall under "Make Parsoid redundant in its current complex avatar"?
Short answer: the latter.
Long answer: read on.
Our immediate focus in the coming months would be to bring PHP parser 
and Parsoid output closer. Some of that work would be to tweak Parsoid 
output / CSS where required, but also to bring PHP parser output closer 
to Parsoid output. https://gerrit.wikimedia.org/r/#/c/196532/ is one 
step along those lines, for example. Scott has said he will review that 
closely with this goal in mind. Another step is to get rid of Tidy and 
use a HTML5 compliant tree builder similar to what Parsoid uses.
Beyond these initial steps, bringing the two together (both in terms of 
output and functionality) will require bridging the computation models 
... string-based vs. DOM-based. For example, we cannot really add 
Parsoid-style metadata for templates to the PHP parser output without 
being able to analyze the DOM -- that requires us to access the DOM 
after Tidy (or the Tidy-replacement ideally) has a go at it. It requires 
us to implement all the dirty tricks we implement to identify template 
boundaries in the presence of unclosed tags, misnested tags, fostered 
content from tables, and dom restructuring the HTML tree builder does to 
comply with HTML5 semantics.
Besides that, if you want to also serialize this back to wikitext 
without introducing dirty diffs (there is really no reason to do all 
this extra work if you cannot also serialize it back to wikitext), you 
also need to be able to either (a) maintain a lot of extra state in the 
DOM beyond what Parsoid maintains, or (b) do all the additional work 
that Parsoid does to maintain an extremely precise mapping between 
wikitext strings and DOM trees. Once again, the only reason (b) is 
complicated is because of unclosed tags, misnested tags, fostered 
content, DOM restructuring because of HTML5 semantics.
There is a fair amount of complexity hidden there in those 2 steps, and 
it really does not make sense to reimplement all of that in the PHP 
parser. If you do, at that point, you've effectively reimplemented 
Parsoid in PHP -- the PHP parser in its current form is unlikely to stay 
as is.
So, the only real way out here is to move the wikitext computational 
model closer to a DOM model. This is not a done deal really, but we have 
talked about several ideas over the last couple years to move this 
forward in increments. I don't want to go into a lot of detail in this 
email since this is already getting lengthy, but I am happy to talk more 
about it if there is interest.
To summarize, here are the steps as we see it:
* Bring PHP parser and Parsoid output as close as we can (replace Tidy, 
fix PHP parser output wherever possible to be closer to Parsoid output).
* Incrementally move wikitext computational model to be DOM based using 
Parsoid as the bridge that preserves compatibility. This is easier if we 
have removed Tidy from the equation.
* Smoothen out the harder edge cases which simplifies the problem and 
eliminates the complexity
* At this point, Parsoid current complexity will be unnecessary 
(specifics dependent on previous steps) => you could have this 
functionality back in PHP if it is so desired. But, by then, hopefully, 
there will also be better clarity about mediawiki packaging that will 
also influence this. Or, some small wikis might decide to be HTML-only 
wikis.
Subbu.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] [Engineering] Parsoid announcement: Main roundtrip quality target achieved