Re: [Wikitech-l] Parsoid's progress

20 Jan 2015

      Some quick comments.
As has already been alluded to, Parsoid does a couple different things.
* It converts wikitext to html (in such a way that edits to the html
   can be serialized back to wikitext without introducing dirty diffs
   in the wikitext).
* It converts html to wikitext (in such a way that edits to the wikitext
   preserves html semantics). There are caveats here in that Parsoid
   doesn't yet handle arbitrary html that you might throw at it, but 
insofaras
   the HTML conforms to the DOM spec [1], Parsoid should do a good job of
   serializing it to wikitext.
This bidirectionality means that Parsoid can support clients that don't
need to deal with wikitext directly knowing that Parsoid can go both ways.
Amir has mentioned Content Translation. See the full list of clients 
here [3].
This support for bidirectional conversion between wikitext and HTML is
non-trivial. See "Lossless conversion" section and other details in [2]. 
Getting
to this stage where Parsoid is in terms of rendering and bidrectionality has
required us to work through a lot of issues and edge cases given that 
editing
requires HTML semantics and wikitext and transclusions is string-based. 
Parsoid
can map a DOM node to a substring of wikitext that generated it, and 
that is also
a non-trivial achievement. See the tech talk here [4]. I'm skipping the 
details of
the different levels of testing that we implement to achieve this, but 
that has been
a substantial part of getting to this part and being able to deploy 
seamlessly on a
regular basis [5] largely without incident.
As for the other part about preprocessing, yes, Parsoid currently relies on
the Mediawiki API.
The core parser has the following components:
* preprocessing that expands transclusions, extensions (including 
Scribunto),
   parser functions, include directives, etc. to wikitext
* wikitext parser that converts wikitext to html
* Tidy that runs on the html produced by wikitext parser and fixes up
   malformed html
Parsoid right now replaces the last two of the three components, but in
a way that enables all of the functionality stated earlier. I'll skip 
the historical and
technical reasons right now why we haven't put energy and resources into 
this
component of Parsoid, but in brief, we found it more important to enable the
bidirectional functionality and supporting clients and reuse the 
preprocessing
functionality via the mediawiki API.
But, there are several directions this can go from here (including 
implementing
a preprocessor in Parsoid, for example). However, note that this discussion
is not entirely about Parsoid but also about shared hosting support,
mediawiki packaging, pure PHP mediawiki install, HTML-only wikis, etc.
All those other decisions inform what Parsoid should focus on and how it 
evolves.
Subbu.
[1] http://www.mediawiki.org/wiki/Parsoid/MediaWiki_DOM_spec
[2] 
https://blog.wikimedia.org/2013/03/04/parsoid-how-wikipedia-catches-up-with-...
[3] http://www.mediawiki.org/wiki/Parsoid/Users
[4] https://www.youtube.com/watch?v=Eb5Ri0xqEzk with slides @ 
https://commons.wikimedia.org/wiki/File:Parsoid.techtalk.apr15.2014.pdf
[5] https://www.mediawiki.org/wiki/Parsoid/Deployments
On 01/20/2015 08:02 AM, C. Scott Ananian wrote:
...
I believe Subbu will follow up with a more complete response, but I'll note
that:

no plan survives first encounter with the enemy.  Parsoid was going to

be simpler than the PHP parser, Parsoid was going to be written in PHP,
then C, then prototyped in JS for a later implementation in C, etc.  It has
varied over time as we learned more about the problem.  It is currently
written in node.js and probably is at least the same order of complexity as
the existing PHP parser.  It is, however, built on slightly more solid
foundations, so its behavior is more regular than the PHP parser in many
places -- although I've been submitting patches to the core parser where
necessary to try to bring them closer together.  (c.f.
https://gerrit.wikimedia.org/r/180982 for the most recent of these.)  And,
of course, Parsoid emits well-formed HTML which can be round-tripped.
In many cases Parsoid could be greatly simplified if we didn't have to
maintain compatibility with various strange corner cases in the PHP parser.

Parsoid contains a partial implementation of the PHP expandtemplates

module.  It was decided (I think wisely) that we didn't really gain
anything by trying to reimplement this on the Parsoid side, though, and it
was better to use the existing PHP code via api.php.  The alternative would
be to basically reimplement quite a lot of mediawiki (lua embedding, the
various parser functions extensions, etc) in node.js.  This *could* be done
-- there is no technical reason why it cannot -- but nobody thinks it's a
good idea to spend time on right now.
But the expandtemplates stuff basically works.   As I said, it doesn't
contain all the crazy extensions that we use on the main WMF sites, but it
would be reasonable to turn it on for a smaller stock mediawiki instance.
In that sense it *could* be a full replacement for the Parser.
But note that even as a full parser replacement Parsoid depends on the PHP
API in a large number of ways: imageinfo, siteinfo, language information,
localized keywords for images, etc.  The idea of "independence" is somewhat
vague.
   --scott
On Mon, Jan 19, 2015 at 11:58 PM, MZMcBride z@mzmcbride.com wrote:
...
Matthew Flaschen wrote:
...
On 01/19/2015 08:15 AM, MZMcBride wrote:
...
And from this question flows another: why is Parsoid
calling MediaWiki's api.php so regularly?
I think it uses it for some aspects of templates and hooks.  I'm sure
the Parsoid team could explain further.
I've been discussing Parsoid a bit and there's apparently an important
distinction between the preprocessor(s) and the parser. Though in practice
I think "parser" is used pretty generically. Further notes follow.
I'm told in Parsoid, <ref> and {{!}} are special-cased, while most other
parser functions require using the expandtemplates module of MediaWiki's
api.php. As I understand it, calling out to api.php is intended to be a
permanent solution (I thought it might be a temporary shim).
If the goal was to just add more verbose markup to parser output, couldn't
we just have done that (in PHP)? Node.js was chosen over PHP due to
speed/performance considerations and concerns, from what I now understand.
The view that Parsoid is going to replace the PHP parser seems to be
overly simplistic and goes back to the distinction between the parser and
preprocessor. Full wikitext transformation seems to require a preprocessor.
MZMcBride

Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Parsoid's progress