Re: [Wikitech-l] Timelines for Parsoid/PHP to replace legacy PHP parser

14 Jan 2020

      Thank you so much for the very extensive reply! That's all I hoped to get,
and so much more. Much appreciated!
On Sun, 12 Jan 2020 at 15:37, Subramanya Sastry ssastry@wikimedia.org
wrote:
...
On 1/12/20 5:33 PM, Lord_Farin wrote:
...
Hi Wikitech,
I've been catching up on the recent achievements regarding Parsoid/PHP,
well done!
Thanks!
The switchover of wikitext engines is going to take some time. I would
be surprised if we got all the ducks lined up before 18 months from now
-- we have a bunch of work to do still.
Other details below:
...
With WMF sites being migrated, of course non-WMF sites start to creep
into
...
the picture. As I'm involved in running one of those, I'm curious to know
if and how you are going to support this upgrade? I've read about Linter
and ParserMigration but I'm not clear on how they fit into the picture.
We built the Linter and ParserMigration extensions to support the
replacement of HTML4 Tidy with RemexHTML [1]. We anticipate leveraging
those in our efforts to consolidate behind Parsoid (post-unification
work) as the default wikitext engine for MediaWiki. We don't quite know
the specifics yet. My hunch is that this replacement is going to be most
complex for Wikimedia wikis and expect most 3rd party wikis to have a
much easier time switching over.
...
I'm asking specifically because we are running some custom extensions
which
...
will probably break with the advent of Parsoid/PHP.
At present we are running MW 1.33 on PHP 7.0, but we are not using VE.
It would be fine if we as a maintenance team have to invest some (or even
considerable) time and effort but I would like to know the size of the
endeavour beforehand...
One of the changes that will take some work is how extensions interact
with the parser (Parsoid in the future). So far, this happens through
access to the Parser object as well as through parser hooks. However, in
the Parsoid regime, this model will change. While the details are yet to
be finalized and we are yet to publish the first draft for review
(likely in the next couple months), here is how we've been thinking
about this:

Extensions will no longer have direct access to the parser itself --

all interaction will be through an API / interface.

Hooks are unlikely to be based on timelines of how wikitext passes

through the parser, i.e. before something happens, or after something
happens. We are going to move more towards a pure functional model as
far as possible. So, as far as extension tags are concerned, they get
access to the tag source, args, and possibly some other information and
are expected to return output HTML / a DOM fragment (here they will
leverage the parser API/interface I mention in 1. above). Most
extensions that implement custom tags already behave in this manner and
this simply formalizes that.

Some extensions set parser state and update it across invocations. We

currently have no intention of supporting that. We are going to look at
what the underlying need is that is being modeled through side-effects /
state and will to provide first-class support for that in some manner.
For example, some (like Cite) use state for enumeration and numbering
purposes, and this can be done as a post-processing pass on the DOM when
they get to inspect the "final" DOM. Presumably these global document
processors are the exception, not the norm. But, statelessness lets us
process the document in arbitrary order (or even skip processing parts
of the document by reusing extension/template/media output from previous
versions of the document), and use the final post-processing step as the
synchronization step to enforce source-text ordering (like numbering).
We anticipate most extensions are going to need some (hopefully minor)
changes. If your extension doesn't deal with wikitext itself, the
changes are probably going to be relatively minor. But, if your
extension deals with wikitext, then it might need an update in terms of
how it generates its output (using the ParsoidExtensionAPI interface
instead of an actual parser object), but once again, this is unlikely to
be very significant changes. However, if your extension maintains state
across invocations, then it might need some rethink (as stated in 3.
above).
As our extension currently does some hacks to get a correct behaviour when
modifying the ToC, I guess that the new model will be a fair bit easier. As
stated, it will no longer be necessary to work with wikitext and instead a
hook at postprocessing level should be sufficient for that. This makes all
the templating also much easier.
In summary I'm looking forward to the exact implementation and while I
expect considerable work, it will be no way near the complexity that has
currently been built; just work migrating from A to B.
...
If you want to get a really early look, you can poke around the Parsoid
repo and its reimplementation of a few extensions [2]. But, note that we
still have some work to do to (a) clean up the interfaces, (b) untangle
them further from Parsoid's internals and (c) make sure our design is
consistent with Tim's proposed work around hooks in general [3] [4]. So,
what you see in the Parsoid repo today may not be what it will look like
in the end (in terms of exact interfaces - names, methods, signatures),
but they will nevertheless operate within the constraints / principles
1-3 above.
As a long-term goal, we are trying to nudge wikitext (including
templates, extensions) towards one where the final output is a
composition of largely independent fragments (no matter who/what
generated those fragments) with some mostly minor post-processing after
the document is composed. An updated extension and parser hooks API
during the switch to Parsoid is one of the first steps. Balanced / Typed
templates will be the next step in that direction. [5].
Hope this helps in planning early. Thanks for asking - it nudged me to
outline our thinking early even before we have the publishable first
draft of our updated extension model.
Subbu ( on behalf of the Parsing Team ).
[1] https://blog.wikimedia.org/2018/07/09/tidy-html5-replacement/
[2] https://github.com/wikimedia/parsoid/tree/master/src/Ext
[3]
https://lists.wikimedia.org/pipermail/wikitech-l/2019-December/092867.html
[4] https://phabricator.wikimedia.org/T240307
[5] https://phabricator.wikimedia.org/T114445
To read some more context (some of which I had been following before) and
seeing it come together into a coherent plan of action is really cool.
Best of luck with all the complexity still ahead, and props for what you
all have achieved so far!
Best,
LF

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Timelines for Parsoid/PHP to replace legacy PHP parser