Re: [Wikitech-l] Timelines for Parsoid/PHP to replace legacy PHP parser

12 Jan 2020


      On 1/12/20 5:33 PM, Lord_Farin wrote:
...
Hi Wikitech,
I've been catching up on the recent achievements regarding Parsoid/PHP,
well done!
Thanks!
The switchover of wikitext engines is going to take some time. I would 
be surprised if we got all the ducks lined up before 18 months from now 
-- we have a bunch of work to do still.
Other details below:
...
With WMF sites being migrated, of course non-WMF sites start to creep into
the picture. As I'm involved in running one of those, I'm curious to know
if and how you are going to support this upgrade? I've read about Linter
and ParserMigration but I'm not clear on how they fit into the picture.
We built the Linter and ParserMigration extensions to support the 
replacement of HTML4 Tidy with RemexHTML [1]. We anticipate leveraging 
those in our efforts to consolidate behind Parsoid (post-unification 
work) as the default wikitext engine for MediaWiki. We don't quite know 
the specifics yet. My hunch is that this replacement is going to be most 
complex for Wikimedia wikis and expect most 3rd party wikis to have a 
much easier time switching over.
...
I'm asking specifically because we are running some custom extensions which
will probably break with the advent of Parsoid/PHP.
At present we are running MW 1.33 on PHP 7.0, but we are not using VE.
It would be fine if we as a maintenance team have to invest some (or even
considerable) time and effort but I would like to know the size of the
endeavour beforehand...
One of the changes that will take some work is how extensions interact 
with the parser (Parsoid in the future). So far, this happens through 
access to the Parser object as well as through parser hooks. However, in 
the Parsoid regime, this model will change. While the details are yet to 
be finalized and we are yet to publish the first draft for review 
(likely in the next couple months), here is how we've been thinking 
about this:
1. Extensions will no longer have direct access to the parser itself -- 
all interaction will be through an API / interface.
2. Hooks are unlikely to be based on timelines of how wikitext passes 
through the parser, i.e. before something happens, or after something 
happens. We are going to move more towards a pure functional model as 
far as possible. So, as far as extension tags are concerned, they get 
access to the tag source, args, and possibly some other information and 
are expected to return output HTML / a DOM fragment (here they will 
leverage the parser API/interface I mention in 1. above). Most 
extensions that implement custom tags already behave in this manner and 
this simply formalizes that.
3. Some extensions set parser state and update it across invocations. We 
currently have no intention of supporting that. We are going to look at 
what the underlying need is that is being modeled through side-effects / 
state and will to provide first-class support for that in some manner. 
For example, some (like Cite) use state for enumeration and numbering 
purposes, and this can be done as a post-processing pass on the DOM when 
they get to inspect the "final" DOM. Presumably these global document 
processors are the exception, not the norm. But, statelessness lets us 
process the document in arbitrary order (or even skip processing parts 
of the document by reusing extension/template/media output from previous 
versions of the document), and use the final post-processing step as the 
synchronization step to enforce source-text ordering (like numbering).
We anticipate most extensions are going to need some (hopefully minor) 
changes. If your extension doesn't deal with wikitext itself, the 
changes are probably going to be relatively minor. But, if your 
extension deals with wikitext, then it might need an update in terms of 
how it generates its output (using the ParsoidExtensionAPI interface 
instead of an actual parser object), but once again, this is unlikely to 
be very significant changes. However, if your extension maintains state 
across invocations, then it might need some rethink (as stated in 3. 
above).
If you want to get a really early look, you can poke around the Parsoid 
repo and its reimplementation of a few extensions [2]. But, note that we 
still have some work to do to (a) clean up the interfaces, (b) untangle 
them further from Parsoid's internals and (c) make sure our design is 
consistent with Tim's proposed work around hooks in general [3] [4]. So, 
what you see in the Parsoid repo today may not be what it will look like 
in the end (in terms of exact interfaces - names, methods, signatures), 
but they will nevertheless operate within the constraints / principles 
1-3 above.
As a long-term goal, we are trying to nudge wikitext (including 
templates, extensions) towards one where the final output is a 
composition of largely independent fragments (no matter who/what 
generated those fragments) with some mostly minor post-processing after 
the document is composed. An updated extension and parser hooks API 
during the switch to Parsoid is one of the first steps. Balanced / Typed 
templates will be the next step in that direction. [5].
Hope this helps in planning early. Thanks for asking - it nudged me to 
outline our thinking early even before we have the publishable first 
draft of our updated extension model.
Subbu ( on behalf of the Parsing Team ).
[1] https://blog.wikimedia.org/2018/07/09/tidy-html5-replacement/
[2] https://github.com/wikimedia/parsoid/tree/master/src/Ext
[3] 
https://lists.wikimedia.org/pipermail/wikitech-l/2019-December/092867.html
[4] https://phabricator.wikimedia.org/T240307
[5] https://phabricator.wikimedia.org/T114445

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Timelines for Parsoid/PHP to replace legacy PHP parser