[MediaWiki-l] RFC: Parsoid & Extensions

Subramanya Sastry ssastry at wikimedia.org
Mon Sep 14 17:16:49 UTC 2020

[---- Long mail - but only relevant to extension developers ----]


As some of you might know, on the Parsing Team [0], we are aspiring to
replace the core wikitext parser with Parsoid [1] on Wikimedia wikis late
next year and start to put to rest the two-parser ghost that has haunted us
for many years. In recent years, we achieved two major milestones along
the way: replace HTML4 tidy with HTML5 Remex [2], and port Parsoid from
Javascript to PHP [3].

Given that context, if you (help) maintain an extension that:

* uses a "parser hook" and/or
* uses the "parser API" (i.e. uses public properties / methods in
   Parser.php, ParserOutput.php, ParserOptions.php, etc.)

please read on. If you don't fit that description, you can stop reading 

Parsoid models and processes wikitext quite differently from the
core parser - all that Parsoid guarantees is that the rendering is largely
identical, not the specific process of generating the rendering. This
means that extensions that extend the behavior of the parser will need to
adapt to work with Parsoid instead to provide similar functionality. With
that in mind, we have been working to more clearly specify how extensions
need to adapt to the Parsoid regime.


At a high level, here are the questions we needed to answer, along with
some highly simplified answers:

1. How do extensions "hook" into Parsoid?
A. Extensions need to think in terms of transformations (convert this
    to that) instead of parser pipeline events (at this point in the
    pipeline, call this listener). An additional detail here is that
    extensions cannot maintain global ordered state within extension code
    since Parsoid doesn't guarantee handlers will be invoked in the same
    order in which they showed up in page source. See the wiki [4] for
    more details.

    As for the mechanics of registration, Parsoid uses existing mechanisms
    based on the extension.json file.

2. When the registered hook listeners are invoked by Parsoid, how do they
    process any wikitext they need to process?
A. Parsoid provides all registered listeners with an API object to interact
    with it. Direct use of Parsoid internals code is strongly discouraged
    and will be enforced in various ways including via code review.

3. How is the extension's output assimilated into the page output?
A. The output is treated as a "fully-processed" page/DOM fragment (with
    some caveats which will be clarified on wiki). It is appropriately
    decorated with additional markup, and slotted into place into the page.
    Extensions need not make any special efforts (aka strip state) to
    protect it from the parsing pipeline.

Slides 8-12 of the August 12 2020 Tech Talk [7] goes over the differences.
Check the wiki [4] for more details of Parsoid's Extension API. It also
maps core parser hooks to Parsoid's extension functionality.


We consider the current proposal to be in late draft stage. That said, as
we discover unsupported functionality, we will augment the set of hooks and
the Parsoid Extension API as needed.

While there are a wide variety of extensions in the MediaWiki universe
with varied use cases, our initial goal for the next year is just Wikimedia
wikis and hence extensions that are deployed on the Wikimedia wikis.
Once we are done with that, we will turn our attention to supporting
extension use cases in the wider MediaWiki universe. But, now is a
good time for all extension developers to study and review this API
and give us feedback.

Since the beginning of this year, we've refactored all of the extensions
we've written Parsoid versions of (Cite, Gallery, Poem, Pre, JSON) to
now strictly use the Parsoid Extension API without cheating by virtue
of being in the Parsoid codebase. So, this proposal is actually backed
by an implementation that is in production for Wikimedia wikis.


Here is where you come in.

* If you maintain / develop an extension, please review the document
   to see if your extension's use case is covered.

   Ideally, leave your feedback on the Parsoid Extension API talk page [5]
   since it helps keep it all in one place. Alternatively, you can also
   leave questions / concerns / other feedback on the Phabricator task
   we've filed for TechCom's RFC process [6].

* If you feel bold, start the process of updating your extensions *now*.
   Note that your extension will need to operate with both the existing
   core parser as well as Parsoid till such time we deprecate and stop
   using the core parser.

   There are known functionality gaps related to exposing ParserOutput
   object and providing setFunctionHook functionality. If your extension
   needs those, you should probably wait for us to fill that gap.


* Check the wiki page [4] for docs and discuss on the talk page [5]
* Check the August 12, 2020 Tech Talk [7]
* Look at Parsoid code for extensions [8]
* Look at Parsoid docs for the Ext/ namespace [9]
* Talk to us on IRC in the #mediawiki-parsoid channel
* Email us at parsing-team at wikimedia.org

Subbu (on behalf of the Parsing Team).


0. https://www.mediawiki.org/wiki/Parsing
1. https://www.mediawiki.org/wiki/Parsing/Parser_Unification
2. https://blog.wikimedia.org/2018/07/09/tidy-html5-replacement/

4. https://www.mediawiki.org/wiki/Parsoid/Extension_API
5. https://www.mediawiki.org/wiki/Parsoid/Talk:Extension_API
6. https://phabricator.wikimedia.org/T260714
7. Slides: 

    Video: https://www.youtube.com/watch?v=lS1xPkERWCM
8. https://github.com/wikimedia/parsoid/tree/master/src/Ext
9. https://doc.wikimedia.org/Parsoid-PHP/master/

More information about the MediaWiki-l mailing list