On 06/29/2015 09:19 AM, Brad Jorsch (Anomie) wrote:
On Thu, Jun 25, 2015 at 6:22 PM, Subramanya Sastry <ssastry@wikimedia.org mailto:ssastry@wikimedia.org> wrote:
* Pare down rendering differences between the two systems so that we can start thinking about using Parsoid HTML instead of MWParser HTML for read views. ( https://phabricator.wikimedia.org/T55784 )
Any hope of adding the Parsoid metadata to the MWParser HTML so various fancy things can be done in core MediaWiki for smaller installations instead of having to run a separate service? Or does that fall under "Make Parsoid redundant in its current complex avatar"?
Short answer: the latter. Long answer: read on.
Our immediate focus in the coming months would be to bring PHP parser and Parsoid output closer. Some of that work would be to tweak Parsoid output / CSS where required, but also to bring PHP parser output closer to Parsoid output. https://gerrit.wikimedia.org/r/#/c/196532/ is one step along those lines, for example. Scott has said he will review that closely with this goal in mind. Another step is to get rid of Tidy and use a HTML5 compliant tree builder similar to what Parsoid uses.
Beyond these initial steps, bringing the two together (both in terms of output and functionality) will require bridging the computation models ... string-based vs. DOM-based. For example, we cannot really add Parsoid-style metadata for templates to the PHP parser output without being able to analyze the DOM -- that requires us to access the DOM after Tidy (or the Tidy-replacement ideally) has a go at it. It requires us to implement all the dirty tricks we implement to identify template boundaries in the presence of unclosed tags, misnested tags, fostered content from tables, and dom restructuring the HTML tree builder does to comply with HTML5 semantics.
Besides that, if you want to also serialize this back to wikitext without introducing dirty diffs (there is really no reason to do all this extra work if you cannot also serialize it back to wikitext), you also need to be able to either (a) maintain a lot of extra state in the DOM beyond what Parsoid maintains, or (b) do all the additional work that Parsoid does to maintain an extremely precise mapping between wikitext strings and DOM trees. Once again, the only reason (b) is complicated is because of unclosed tags, misnested tags, fostered content, DOM restructuring because of HTML5 semantics.
There is a fair amount of complexity hidden there in those 2 steps, and it really does not make sense to reimplement all of that in the PHP parser. If you do, at that point, you've effectively reimplemented Parsoid in PHP -- the PHP parser in its current form is unlikely to stay as is.
So, the only real way out here is to move the wikitext computational model closer to a DOM model. This is not a done deal really, but we have talked about several ideas over the last couple years to move this forward in increments. I don't want to go into a lot of detail in this email since this is already getting lengthy, but I am happy to talk more about it if there is interest.
To summarize, here are the steps as we see it:
* Bring PHP parser and Parsoid output as close as we can (replace Tidy, fix PHP parser output wherever possible to be closer to Parsoid output). * Incrementally move wikitext computational model to be DOM based using Parsoid as the bridge that preserves compatibility. This is easier if we have removed Tidy from the equation. * Smoothen out the harder edge cases which simplifies the problem and eliminates the complexity * At this point, Parsoid current complexity will be unnecessary (specifics dependent on previous steps) => you could have this functionality back in PHP if it is so desired. But, by then, hopefully, there will also be better clarity about mediawiki packaging that will also influence this. Or, some small wikis might decide to be HTML-only wikis.
Subbu.