Parsoid's progress

List overview All Threads
Download

newer

older

Announcement: Chad Horohoe joins...

Brion's role change within WMF

MZMcBride

19 Jan 2015 19 Jan '15

4:15 p.m.

(Combining pieces of Jay's thread and pieces of the shared hosting thread.)

Daniel Friesen wrote:

...

Parsoid can do Parsoid DOM to WikiText conversions. So I believe the suggestion is that storage be switched entirely to the Parsoid DOM and WikiText in classic editing just becomes a method of editing the content that is stored as Parsoid DOM in the backend.

Tim Starling wrote:

...

Parsoid depends on the MediaWiki parser, it calls it via api.php. It's not a complete, standalone implementation of wikitext to HTML transformation.

HTML storage would be a pretty simple feature, and would allow third-party users to use VE without Parsoid. It's not so simple to use Parsoid without the MediaWiki parser, especially if you want to support all existing extensions.

So, as currently proposed, HTML storage is actually a way to reduce the dependency on services for non-WMF wikis, not to increase it.

Based on recent comments from Gabriel and Subbu, my understanding is that there are no plans to drop the MediaWiki parser at the moment.

Yeah... what is this all about? My understanding (and please correct me if I'm wrong) is that Parsoid is/was intended to be a standalone service capable of translating wikitext <--> HTML. You seem to be stating that Parsoid is neither complete nor standalone. Why?

Currently Parsoid is the largest client of the MediaWiki PHP parser, I'm told. If Parsoid is regularly calling and relying upon the MediaWiki PHP parser, what exactly is the point of Parsoid?

How much parity is there between Parsoid without the use of the MediaWiki parser and the MediaWiki parser? That is, if you selected a random sample of pages from a Wikimedia wiki, how many of them could Parsoid correctly parse on its own? And from this question flows another: why is Parsoid calling MediaWiki's api.php so regularly?

I'm also interested in Parsoid's development as it relates to the broader push for services. If Parsoid is going to be the model of future services development, I'd like a clearer evaluation of what kind of model it is.

Again, please correct me if I'm wrong, mistaken, misinformed, etc., but from my place of limited knowledge, it sounds very unappealing to create large Node.js applications ("services") that closely tie in and require(!) PHP counterparts. This seems like the opposite of moving toward a more flexible, modular architecture. From my perspective, it would seem to only saddle us with additional technical debt moving forward, as we double complexity indefinitely.

MZMcBride

Show replies by date

Arcane 21

19 Jan 19 Jan

6:09 p.m.

If I might weigh in, I concur with MZMcBride. If Parsoid is absolutely needed regardless, that's one thing, but if a VE editing interface can be set up that doesn't need Parsoid, that would reduce dependence on third party software, make installation easier for all parties concerned, and not be as resource intensive, and since the optimization of resources is always a plus IMO, severing dependency on Parsoid and attempting to do it's current functions purely in house seems like a good plan to pursue.

...

Date: Mon, 19 Jan 2015 11:15:54 -0500 From: z@mzmcbride.com To: wikitech-l@lists.wikimedia.org Subject: [Wikitech-l] Parsoid's progress

(Combining pieces of Jay's thread and pieces of the shared hosting thread.)

Daniel Friesen wrote:

...
Parsoid can do Parsoid DOM to WikiText conversions. So I believe the suggestion is that storage be switched entirely to the Parsoid DOM and WikiText in classic editing just becomes a method of editing the content that is stored as Parsoid DOM in the backend.

Tim Starling wrote:

...
Parsoid depends on the MediaWiki parser, it calls it via api.php. It's not a complete, standalone implementation of wikitext to HTML transformation.

HTML storage would be a pretty simple feature, and would allow third-party users to use VE without Parsoid. It's not so simple to use Parsoid without the MediaWiki parser, especially if you want to support all existing extensions.

So, as currently proposed, HTML storage is actually a way to reduce the dependency on services for non-WMF wikis, not to increase it.

Based on recent comments from Gabriel and Subbu, my understanding is that there are no plans to drop the MediaWiki parser at the moment.

Yeah... what is this all about? My understanding (and please correct me if I'm wrong) is that Parsoid is/was intended to be a standalone service capable of translating wikitext <--> HTML. You seem to be stating that Parsoid is neither complete nor standalone. Why?

Currently Parsoid is the largest client of the MediaWiki PHP parser, I'm told. If Parsoid is regularly calling and relying upon the MediaWiki PHP parser, what exactly is the point of Parsoid?

How much parity is there between Parsoid without the use of the MediaWiki parser and the MediaWiki parser? That is, if you selected a random sample of pages from a Wikimedia wiki, how many of them could Parsoid correctly parse on its own? And from this question flows another: why is Parsoid calling MediaWiki's api.php so regularly?

I'm also interested in Parsoid's development as it relates to the broader push for services. If Parsoid is going to be the model of future services development, I'd like a clearer evaluation of what kind of model it is.

Again, please correct me if I'm wrong, mistaken, misinformed, etc., but from my place of limited knowledge, it sounds very unappealing to create large Node.js applications ("services") that closely tie in and require(!) PHP counterparts. This seems like the opposite of moving toward a more flexible, modular architecture. From my perspective, it would seem to only saddle us with additional technical debt moving forward, as we double complexity indefinitely.

MZMcBride

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Matthew Flaschen

20 Jan 20 Jan

2 a.m.

On 01/19/2015 08:15 AM, MZMcBride wrote:

...

Currently Parsoid is the largest client of the MediaWiki PHP parser, I'm told. If Parsoid is regularly calling and relying upon the MediaWiki PHP parser, what exactly is the point of Parsoid?

Parsoid can go:

wikitext => HTML => wikitext

The MediaWiki parser can only go:

wikitext => HTML

The most important part of Parsoid is thus the HTML => wikitext conversion (required for VisualEditor), but other parts of their architecture follow from that.

...

And from this question flows another: why is Parsoid calling MediaWiki's api.php so regularly?

I think it uses it for some aspects of templates and hooks. I'm sure the Parsoid team could explain further.

Matt Flaschen

Amir E. Aharoni

3:37 a.m.

Exactly: The HTML to wikitext conversion is what makes Parsoid useful, and not only for VE.

Thanks to Parsoid, ContentTranslation has a simple rich text editor with contenteditable (not a full VE, though this may change in the future). We are just starting to deploy it to production, but the users who tested in beta labs loved it.

The Parsoid way of converting wikitext to HTML is useful, too, because it allows ContentTranslation to process the article that is being translated in a formal and expected way, understanding where are links, images, templates, timelines, references, etc., and adapting it automatically to the translated article. All of this is done with simple jQuery selectors and very little effort. בתאריך 20 בינו 2015 04:01, ‏"Matthew Flaschen" mflaschen@wikimedia.org כתב:

...

On 01/19/2015 08:15 AM, MZMcBride wrote:

...
Currently Parsoid is the largest client of the MediaWiki PHP parser, I'm told. If Parsoid is regularly calling and relying upon the MediaWiki PHP parser, what exactly is the point of Parsoid?

Parsoid can go:

wikitext => HTML => wikitext

The MediaWiki parser can only go:

wikitext => HTML

The most important part of Parsoid is thus the HTML => wikitext conversion (required for VisualEditor), but other parts of their architecture follow from that.

And from this question flows another: why is Parsoid

...
calling MediaWiki's api.php so regularly?

I think it uses it for some aspects of templates and hooks. I'm sure the Parsoid team could explain further.

Matt Flaschen

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

MZMcBride

4:58 a.m.

Matthew Flaschen wrote:

...

On 01/19/2015 08:15 AM, MZMcBride wrote:

...
And from this question flows another: why is Parsoid calling MediaWiki's api.php so regularly?

I think it uses it for some aspects of templates and hooks. I'm sure the Parsoid team could explain further.

I've been discussing Parsoid a bit and there's apparently an important distinction between the preprocessor(s) and the parser. Though in practice I think "parser" is used pretty generically. Further notes follow.

I'm told in Parsoid, <ref> and {{!}} are special-cased, while most other parser functions require using the expandtemplates module of MediaWiki's api.php. As I understand it, calling out to api.php is intended to be a permanent solution (I thought it might be a temporary shim).

If the goal was to just add more verbose markup to parser output, couldn't we just have done that (in PHP)? Node.js was chosen over PHP due to speed/performance considerations and concerns, from what I now understand.

The view that Parsoid is going to replace the PHP parser seems to be overly simplistic and goes back to the distinction between the parser and preprocessor. Full wikitext transformation seems to require a preprocessor.

MZMcBride

C. Scott Ananian

4:02 p.m.

I believe Subbu will follow up with a more complete response, but I'll note that:

1) no plan survives first encounter with the enemy. Parsoid was going to be simpler than the PHP parser, Parsoid was going to be written in PHP, then C, then prototyped in JS for a later implementation in C, etc. It has varied over time as we learned more about the problem. It is currently written in node.js and probably is at least the same order of complexity as the existing PHP parser. It is, however, built on slightly more solid foundations, so its behavior is more regular than the PHP parser in many places -- although I've been submitting patches to the core parser where necessary to try to bring them closer together. (c.f. https://gerrit.wikimedia.org/r/180982 for the most recent of these.) And, of course, Parsoid emits well-formed HTML which can be round-tripped.

In many cases Parsoid could be greatly simplified if we didn't have to maintain compatibility with various strange corner cases in the PHP parser.

2) Parsoid contains a partial implementation of the PHP expandtemplates module. It was decided (I think wisely) that we didn't really gain anything by trying to reimplement this on the Parsoid side, though, and it was better to use the existing PHP code via api.php. The alternative would be to basically reimplement quite a lot of mediawiki (lua embedding, the various parser functions extensions, etc) in node.js. This *could* be done -- there is no technical reason why it cannot -- but nobody thinks it's a good idea to spend time on right now.

But the expandtemplates stuff basically works. As I said, it doesn't contain all the crazy extensions that we use on the main WMF sites, but it would be reasonable to turn it on for a smaller stock mediawiki instance. In that sense it *could* be a full replacement for the Parser.

But note that even as a full parser replacement Parsoid depends on the PHP API in a large number of ways: imageinfo, siteinfo, language information, localized keywords for images, etc. The idea of "independence" is somewhat vague. --scott

On Mon, Jan 19, 2015 at 11:58 PM, MZMcBride z@mzmcbride.com wrote:

...

Matthew Flaschen wrote:

...
On 01/19/2015 08:15 AM, MZMcBride wrote:

...
And from this question flows another: why is Parsoid calling MediaWiki's api.php so regularly?

I think it uses it for some aspects of templates and hooks. I'm sure the Parsoid team could explain further.

I've been discussing Parsoid a bit and there's apparently an important distinction between the preprocessor(s) and the parser. Though in practice I think "parser" is used pretty generically. Further notes follow.

I'm told in Parsoid, <ref> and {{!}} are special-cased, while most other parser functions require using the expandtemplates module of MediaWiki's api.php. As I understand it, calling out to api.php is intended to be a permanent solution (I thought it might be a temporary shim).

If the goal was to just add more verbose markup to parser output, couldn't we just have done that (in PHP)? Node.js was chosen over PHP due to speed/performance considerations and concerns, from what I now understand.

The view that Parsoid is going to replace the PHP parser seems to be overly simplistic and goes back to the distinction between the parser and preprocessor. Full wikitext transformation seems to require a preprocessor.

MZMcBride

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- (http://cscott.net)

Arcane 21

6:12 p.m.

Given what I've seen so far, it might be best to aim for a gradual reimplementation of Parsoid features to make most of them work without a need for Parsoid, with the eventual goal of severing the need for Parsoid completely if possible. At any rate, the less the parser has to outsource, the less complicated things will be, correct?

...

Date: Tue, 20 Jan 2015 11:02:10 -0500 From: cananian@wikimedia.org To: wikitech-l@lists.wikimedia.org Subject: Re: [Wikitech-l] Parsoid's progress

I believe Subbu will follow up with a more complete response, but I'll note that:

no plan survives first encounter with the enemy. Parsoid was going to

be simpler than the PHP parser, Parsoid was going to be written in PHP, then C, then prototyped in JS for a later implementation in C, etc. It has varied over time as we learned more about the problem. It is currently written in node.js and probably is at least the same order of complexity as the existing PHP parser. It is, however, built on slightly more solid foundations, so its behavior is more regular than the PHP parser in many places -- although I've been submitting patches to the core parser where necessary to try to bring them closer together. (c.f. https://gerrit.wikimedia.org/r/180982 for the most recent of these.) And, of course, Parsoid emits well-formed HTML which can be round-tripped.

In many cases Parsoid could be greatly simplified if we didn't have to maintain compatibility with various strange corner cases in the PHP parser.

Parsoid contains a partial implementation of the PHP expandtemplates

module. It was decided (I think wisely) that we didn't really gain anything by trying to reimplement this on the Parsoid side, though, and it was better to use the existing PHP code via api.php. The alternative would be to basically reimplement quite a lot of mediawiki (lua embedding, the various parser functions extensions, etc) in node.js. This *could* be done -- there is no technical reason why it cannot -- but nobody thinks it's a good idea to spend time on right now.

But the expandtemplates stuff basically works. As I said, it doesn't contain all the crazy extensions that we use on the main WMF sites, but it would be reasonable to turn it on for a smaller stock mediawiki instance. In that sense it *could* be a full replacement for the Parser.

But note that even as a full parser replacement Parsoid depends on the PHP API in a large number of ways: imageinfo, siteinfo, language information, localized keywords for images, etc. The idea of "independence" is somewhat vague. --scott

On Mon, Jan 19, 2015 at 11:58 PM, MZMcBride z@mzmcbride.com wrote:

...
Matthew Flaschen wrote:

...
On 01/19/2015 08:15 AM, MZMcBride wrote:

...
And from this question flows another: why is Parsoid calling MediaWiki's api.php so regularly?

I think it uses it for some aspects of templates and hooks. I'm sure the Parsoid team could explain further.

I've been discussing Parsoid a bit and there's apparently an important distinction between the preprocessor(s) and the parser. Though in practice I think "parser" is used pretty generically. Further notes follow.

I'm told in Parsoid, <ref> and {{!}} are special-cased, while most other parser functions require using the expandtemplates module of MediaWiki's api.php. As I understand it, calling out to api.php is intended to be a permanent solution (I thought it might be a temporary shim).

If the goal was to just add more verbose markup to parser output, couldn't we just have done that (in PHP)? Node.js was chosen over PHP due to speed/performance considerations and concerns, from what I now understand.

The view that Parsoid is going to replace the PHP parser seems to be overly simplistic and goes back to the distinction between the parser and preprocessor. Full wikitext transformation seems to require a preprocessor.

MZMcBride

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- (http://cscott.net) _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Subramanya Sastry

8:03 p.m.

Some quick comments.

As has already been alluded to, Parsoid does a couple different things.

* It converts wikitext to html (in such a way that edits to the html can be serialized back to wikitext without introducing dirty diffs in the wikitext). * It converts html to wikitext (in such a way that edits to the wikitext preserves html semantics). There are caveats here in that Parsoid doesn't yet handle arbitrary html that you might throw at it, but insofaras the HTML conforms to the DOM spec [1], Parsoid should do a good job of serializing it to wikitext.

This bidirectionality means that Parsoid can support clients that don't need to deal with wikitext directly knowing that Parsoid can go both ways. Amir has mentioned Content Translation. See the full list of clients here [3].

This support for bidirectional conversion between wikitext and HTML is non-trivial. See "Lossless conversion" section and other details in [2]. Getting to this stage where Parsoid is in terms of rendering and bidrectionality has required us to work through a lot of issues and edge cases given that editing requires HTML semantics and wikitext and transclusions is string-based. Parsoid can map a DOM node to a substring of wikitext that generated it, and that is also a non-trivial achievement. See the tech talk here [4]. I'm skipping the details of the different levels of testing that we implement to achieve this, but that has been a substantial part of getting to this part and being able to deploy seamlessly on a regular basis [5] largely without incident.

As for the other part about preprocessing, yes, Parsoid currently relies on the Mediawiki API.

The core parser has the following components:

* preprocessing that expands transclusions, extensions (including Scribunto), parser functions, include directives, etc. to wikitext * wikitext parser that converts wikitext to html * Tidy that runs on the html produced by wikitext parser and fixes up malformed html

Parsoid right now replaces the last two of the three components, but in a way that enables all of the functionality stated earlier. I'll skip the historical and technical reasons right now why we haven't put energy and resources into this component of Parsoid, but in brief, we found it more important to enable the bidirectional functionality and supporting clients and reuse the preprocessing functionality via the mediawiki API.

But, there are several directions this can go from here (including implementing a preprocessor in Parsoid, for example). However, note that this discussion is not entirely about Parsoid but also about shared hosting support, mediawiki packaging, pure PHP mediawiki install, HTML-only wikis, etc. All those other decisions inform what Parsoid should focus on and how it evolves.

Subbu.

[1] http://www.mediawiki.org/wiki/Parsoid/MediaWiki_DOM_spec [2] https://blog.wikimedia.org/2013/03/04/parsoid-how-wikipedia-catches-up-with-... [3] http://www.mediawiki.org/wiki/Parsoid/Users [4] https://www.youtube.com/watch?v=Eb5Ri0xqEzk with slides @ https://commons.wikimedia.org/wiki/File:Parsoid.techtalk.apr15.2014.pdf [5] https://www.mediawiki.org/wiki/Parsoid/Deployments

On 01/20/2015 08:02 AM, C. Scott Ananian wrote:

...

I believe Subbu will follow up with a more complete response, but I'll note that:

no plan survives first encounter with the enemy. Parsoid was going to

be simpler than the PHP parser, Parsoid was going to be written in PHP, then C, then prototyped in JS for a later implementation in C, etc. It has varied over time as we learned more about the problem. It is currently written in node.js and probably is at least the same order of complexity as the existing PHP parser. It is, however, built on slightly more solid foundations, so its behavior is more regular than the PHP parser in many places -- although I've been submitting patches to the core parser where necessary to try to bring them closer together. (c.f. https://gerrit.wikimedia.org/r/180982 for the most recent of these.) And, of course, Parsoid emits well-formed HTML which can be round-tripped.

In many cases Parsoid could be greatly simplified if we didn't have to maintain compatibility with various strange corner cases in the PHP parser.

Parsoid contains a partial implementation of the PHP expandtemplates

module. It was decided (I think wisely) that we didn't really gain anything by trying to reimplement this on the Parsoid side, though, and it was better to use the existing PHP code via api.php. The alternative would be to basically reimplement quite a lot of mediawiki (lua embedding, the various parser functions extensions, etc) in node.js. This *could* be done -- there is no technical reason why it cannot -- but nobody thinks it's a good idea to spend time on right now.

But the expandtemplates stuff basically works. As I said, it doesn't contain all the crazy extensions that we use on the main WMF sites, but it would be reasonable to turn it on for a smaller stock mediawiki instance. In that sense it *could* be a full replacement for the Parser.

But note that even as a full parser replacement Parsoid depends on the PHP API in a large number of ways: imageinfo, siteinfo, language information, localized keywords for images, etc. The idea of "independence" is somewhat vague. --scott

On Mon, Jan 19, 2015 at 11:58 PM, MZMcBride z@mzmcbride.com wrote:

...
Matthew Flaschen wrote:

...
On 01/19/2015 08:15 AM, MZMcBride wrote:

...
And from this question flows another: why is Parsoid calling MediaWiki's api.php so regularly?

I think it uses it for some aspects of templates and hooks. I'm sure the Parsoid team could explain further.

I've been discussing Parsoid a bit and there's apparently an important distinction between the preprocessor(s) and the parser. Though in practice I think "parser" is used pretty generically. Further notes follow.

I'm told in Parsoid, <ref> and {{!}} are special-cased, while most other parser functions require using the expandtemplates module of MediaWiki's api.php. As I understand it, calling out to api.php is intended to be a permanent solution (I thought it might be a temporary shim).

If the goal was to just add more verbose markup to parser output, couldn't we just have done that (in PHP)? Node.js was chosen over PHP due to speed/performance considerations and concerns, from what I now understand.

The view that Parsoid is going to replace the PHP parser seems to be overly simplistic and goes back to the distinction between the parser and preprocessor. Full wikitext transformation seems to require a preprocessor.

MZMcBride

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

MZMcBride

21 Jan 21 Jan

6:30 a.m.

Thank you both for the detailed replies. They were very helpful and I feel like I have a better understanding now. I'm still trying to wrap my head around Parsoid, its implementation, and how it fits in with the larger future of MediaWiki development.

Subramanya Sastry wrote:

...

The core parser has the following components:

preprocessing that expands transclusions, extensions (including

Scribunto), parser functions, include directives, etc. to wikitext

wikitext parser that converts wikitext to html

Tidy that runs on the html produced by wikitext parser and fixes up

malformed html

Parsoid right now replaces the last two of the three components, but in a way that enables all of the functionality stated earlier.

Are you saying Parsoid can act as a replacement for HTMLTidy? That seems like a pretty huge win. Replacing Tidy has been a longstanding goal: https://phabricator.wikimedia.org/T4542.

...

But, there are several directions this can go from here (including implementing a preprocessor in Parsoid, for example). However, note that this discussion is not entirely about Parsoid but also about shared hosting support, mediawiki packaging, pure PHP mediawiki install, HTML-only wikis, etc. All those other decisions inform what Parsoid should focus on and how it evolves.

I think this is very well put. There's definitely a lot to think about.

MZMcBride

6:39 a.m.

C. Scott Ananian wrote:

...

no plan survives first encounter with the enemy. Parsoid was going to

be simpler than the PHP parser, Parsoid was going to be written in PHP, then C, then prototyped in JS for a later implementation in C, etc. It has varied over time as we learned more about the problem. It is currently written in node.js and probably is at least the same order of complexity as the existing PHP parser.

Hrm.

...

In many cases Parsoid could be greatly simplified if we didn't have to maintain compatibility with various strange corner cases in the PHP parser.

I guess this is the part that I'm still struggling with. If the PHP parser is/was already doing the job of converting to wikitext to HTML, why would that need to be rewritten in Node.js? Wouldn't it have been simpler to make the HTML output more verbose in the PHP parser so that it could cleanly round-trip? I'm still not clear where Node.js (or C or JavaScript) came into this. I heard there were performance concerns with the PHP parser. Was that the case?

I'm mostly just curious... you can't un-milk the cow, as they say.

...

But note that even as a full parser replacement Parsoid depends on the PHP API in a large number of ways: imageinfo, siteinfo, language information, localized keywords for images, etc. The idea of "independence" is somewhat vague.

Hrm.

MZMcBride

3476

Age (days ago)

3478

Last active (days ago)

wikitech-l@lists.wikimedia.org

9 comments

6 participants

tags (0)

participants (6)

Amir E. Aharoni
Arcane 21
C. Scott Ananian
Matthew Flaschen
MZMcBride
Subramanya Sastry