On 1/12/20 5:33 PM, Lord_Farin wrote:
Hi Wikitech,
I've been catching up on the recent achievements regarding Parsoid/PHP, well done!
Thanks!
The switchover of wikitext engines is going to take some time. I would be surprised if we got all the ducks lined up before 18 months from now -- we have a bunch of work to do still.
Other details below:
With WMF sites being migrated, of course non-WMF sites start to creep into the picture. As I'm involved in running one of those, I'm curious to know if and how you are going to support this upgrade? I've read about Linter and ParserMigration but I'm not clear on how they fit into the picture.
We built the Linter and ParserMigration extensions to support the replacement of HTML4 Tidy with RemexHTML [1]. We anticipate leveraging those in our efforts to consolidate behind Parsoid (post-unification work) as the default wikitext engine for MediaWiki. We don't quite know the specifics yet. My hunch is that this replacement is going to be most complex for Wikimedia wikis and expect most 3rd party wikis to have a much easier time switching over.
I'm asking specifically because we are running some custom extensions which will probably break with the advent of Parsoid/PHP. At present we are running MW 1.33 on PHP 7.0, but we are not using VE. It would be fine if we as a maintenance team have to invest some (or even considerable) time and effort but I would like to know the size of the endeavour beforehand...
One of the changes that will take some work is how extensions interact with the parser (Parsoid in the future). So far, this happens through access to the Parser object as well as through parser hooks. However, in the Parsoid regime, this model will change. While the details are yet to be finalized and we are yet to publish the first draft for review (likely in the next couple months), here is how we've been thinking about this:
1. Extensions will no longer have direct access to the parser itself -- all interaction will be through an API / interface.
2. Hooks are unlikely to be based on timelines of how wikitext passes through the parser, i.e. before something happens, or after something happens. We are going to move more towards a pure functional model as far as possible. So, as far as extension tags are concerned, they get access to the tag source, args, and possibly some other information and are expected to return output HTML / a DOM fragment (here they will leverage the parser API/interface I mention in 1. above). Most extensions that implement custom tags already behave in this manner and this simply formalizes that.
3. Some extensions set parser state and update it across invocations. We currently have no intention of supporting that. We are going to look at what the underlying need is that is being modeled through side-effects / state and will to provide first-class support for that in some manner. For example, some (like Cite) use state for enumeration and numbering purposes, and this can be done as a post-processing pass on the DOM when they get to inspect the "final" DOM. Presumably these global document processors are the exception, not the norm. But, statelessness lets us process the document in arbitrary order (or even skip processing parts of the document by reusing extension/template/media output from previous versions of the document), and use the final post-processing step as the synchronization step to enforce source-text ordering (like numbering).
We anticipate most extensions are going to need some (hopefully minor) changes. If your extension doesn't deal with wikitext itself, the changes are probably going to be relatively minor. But, if your extension deals with wikitext, then it might need an update in terms of how it generates its output (using the ParsoidExtensionAPI interface instead of an actual parser object), but once again, this is unlikely to be very significant changes. However, if your extension maintains state across invocations, then it might need some rethink (as stated in 3. above).
If you want to get a really early look, you can poke around the Parsoid repo and its reimplementation of a few extensions [2]. But, note that we still have some work to do to (a) clean up the interfaces, (b) untangle them further from Parsoid's internals and (c) make sure our design is consistent with Tim's proposed work around hooks in general [3] [4]. So, what you see in the Parsoid repo today may not be what it will look like in the end (in terms of exact interfaces - names, methods, signatures), but they will nevertheless operate within the constraints / principles 1-3 above.
As a long-term goal, we are trying to nudge wikitext (including templates, extensions) towards one where the final output is a composition of largely independent fragments (no matter who/what generated those fragments) with some mostly minor post-processing after the document is composed. An updated extension and parser hooks API during the switch to Parsoid is one of the first steps. Balanced / Typed templates will be the next step in that direction. [5].
Hope this helps in planning early. Thanks for asking - it nudged me to outline our thinking early even before we have the publishable first draft of our updated extension model.
Subbu ( on behalf of the Parsing Team ).
[1] https://blog.wikimedia.org/2018/07/09/tidy-html5-replacement/
[2] https://github.com/wikimedia/parsoid/tree/master/src/Ext
[3] https://lists.wikimedia.org/pipermail/wikitech-l/2019-December/092867.html