Hello everyone,
On behalf of the parsing team, here is an update about Parsoid, the bidirectional wikitext <-> HTML parser that supports Visual Editor, Flow, and Content Translation.
Subbu.
----------------------------------------------------------------------- TL:DR;
1. Parsoid[1] roundtrips 99.95% of the 158K pages in round-trip testing without introducing semantic diffs[2]. 2. With trivial simulated edits, the HTML -> wikitext serializer used in production (selective serialization) introduces ZERO dirty diffs in 99.986% of those edits[3]. 10 of those 23 edits with dirty diffs are minor newline diffs. -----------------------------------------------------------------------
Couple days back (June 23rd), Parsoid achieved 99.95%[2] semantic accuracy in the wikitext -> HTML -> wikitext roundtripping process on the set of about 158K pages randomly picked from about 16 wikis back in 2013. Keeping this test set constant has let us monitor our progress over time. We were at 99.75% last year around this time.
What does this mean? -------------------- * Despite the practical complexities of wikitext, the mismatch in the processing models of wikitext (string-based) and Parsoid (DOM-based), and the various wikitext "errors" that are found on pages, Parsoid is able to maintain a reversible mapping between wikitext constructs and their equivalent HTML DOM trees that HTML editors and other tools can manipulate.
The majority of differences in the 0.05% arise because of wikitext errors: links in links, 'fosterable'[4] content in tables, and some scenarios with unmatched quotes in attributes. Parsoid does not support round-tripping (RT) of these.
* While this is not a big change from how it has been for about a year now in terms of Parsoid's support for editing, this is a notable milestone for us in terms of the confidence we have in Parsoid's ability to handle the wikitext usage seen in production wikis and our ability to RT them accurately without corrupting pages. This should also boost confidence of all applications that rely on Parsoid.
* In production, Parsoid uses a selective serialization strategy which tries to preserve unedited parts of wikitext as far as possible.
As part of regular testing, we also simulate a trivial edit by adding a new comment to the page and run the edited HTML through this selective serializer. All but 23 pages (0.014% of trivial edits) had ZERO dirty diffs[3]. Of these 23, 10 of the diffs were minor newline diffs.
In production, the dirty diff rate will be higher than 0.014% because of more complex edits and because of bugs in any of 3 components involved in visual editing on Wikipedias (Parsoid, RESTBase[5] and Visual Editor) and their interaction. But, the base accuracy of Parsoid's roundtripping (both in terms of full and selective serialization) is critical to ensuring clean visual edits. The above milestones are part of ensuring that.
What does this not mean? ------------------------ * If you edit one of those 0.05% of pages in VE, the VE-Parsoid combination will break the page. NO!
If you edit the broken part of the page, Parsoid will very likely normalize the broken wikitext to the non-erroneous form (break up nested links, move fostered content out of the table, drop duplicate transclusion parameters, etc.) In the odd case, it could cause a dirty diff that changes the semantics of those broken constructs.
* Parsoid's visual rendering is 99.95% identical to PHP parser rendering. NO!
RT tests are focused on Parsoid's ability to support editing without introducing dirty diffs. Even though Parsoid might render a page differently than the default read view (and might even be incorrect), we are nevertheless able to RT it without breaking the wikitext.
On the way to getting to 99.95% RT accuracy, we have improved and fixed several bugs in Parsoid's rendering. The rendering is also fairly identical to the default read view (otherwise, VE editors will definitely complain). However, we haven't done sufficient testing to systematically identify rendering incompatibilities and quantify this. In the coming quarters, we are going to turn our attention to this problem. We have a visual diffing infrastructure to help us with this (we take screenshots of Parsoid's output and the default output and compare those images and find diffs). We'll have to tweak and fix our visual-diffing setup and then fix rendering problems we find.
* 100% roundtripping accuracy is within reach. NO!
The reality is that there are a lot of pages out there that have various kinds of broken markup (mis-nested html tags, unmatched html tags, broken templates) in production. There are probably other edge case scenarios that trigger different behavior in Parsoid and the PHP parser. Because we go to great lengths in Parsoid to avoid dirty diffs, our selective serialization works quite well. There have been very few reports of page corruption over the last year. And, where they have surfaced, we've usually moved pretty quickly to fix them, and we'll continue to do so.
In addition, our diff classification algo will never be perfect and there will always be false positives. Overall, we may crawl further along by 0.01% or 0.02%, but we are not holding our breath and neither should you.
* If we pick a new corpus of 100K pages, we'll have similar accuracy. MAYBE!
Because we've tested against a random sample of pages across multiple Wikipedias, we expect that we've encountered the vast majority of scenarios that Parsoid will encounter in production. So, we have a very high degree of confidence that our fixes are not tailored to our test pages.
As part of https://phabricator.wikimedia.org/T101928 we will be doing a refresh of our test set, focusing more on enwp pages, non-Wikipedia test pages, and probably introducing a set of high traffic pages.
Next steps ---------- Given where we are now, we can now start thinking about the next level with a bit more focus and energy. Our next steps are to bring the PHP parser and Parsoid closer both in terms of output and long-term capabilities.
Some possibilities: * Replace Tidy ( https://phabricator.wikimedia.org/T89331 ) * Pare down rendering differences between the two systems so that we can start thinking about using Parsoid HTML instead of MWParser HTML for read views. ( https://phabricator.wikimedia.org/T55784 ) * Use Parsoid as a WikiLint tool https://phabricator.wikimedia.org/T48705 https://www.mediawiki.org/wiki/Parsoid/Linting/GSoC_2014_Application * Support improved templating abilities (data-driven tables, etc.) * Improve Parsoid's parsing performance. * Implement stable ids to be able to attach long-lived metadata to the DOM and track it across edits. * Move wikitext to a DOM-based processing model, using Parsoid as a bridge. This could make several useful things possible, e.g. much better automatic edit conflict resolution. * Long-term: Make Parsoid redundant in its current complex avatar.
References ---------- [1] https://www.mediawiki.org/wiki/Parsoid -- bidirectional parser supporting visual editing [2] http://parsoid-tests.wikimedia.org/failsDistr http://parsoid-tests.wikimedia.org/topfails shows the actual failures [3] http://parsoid-tests.wikimedia.org/rtselsererrors/aa5804ca89dc644f744af24c47... [4] http://dev.w3.org/html5/spec-LC/tree-construction.html#foster-parenting [5] https://www.mediawiki.org/wiki/RESTBase#Use_cases
<quote name="Subramanya Sastry" date="2015-06-25" time="17:22:53 -0500">
TL:DR;
- Parsoid[1] roundtrips 99.95% of the 158K pages in round-trip testing without introducing semantic diffs[2].
- With trivial simulated edits, the HTML -> wikitext serializer used in production (selective serialization) introduces ZERO dirty diffs in 99.986% of those edits[3]. 10 of those 23 edits with dirty diffs are minor newline diffs.
Huge congrats, Subbu and team!
On 25 June 2015 at 23:22, Subramanya Sastry ssastry@wikimedia.org wrote:
On behalf of the parsing team, here is an update about Parsoid, the bidirectional wikitext <-> HTML parser that supports Visual Editor, Flow, and Content Translation.
eeeexcellent. How close are we to binning the PHP parser? (I realise that's a way off, but grant me my dreams.)
- d.
On 06/25/2015 06:29 PM, David Gerard wrote:
On 25 June 2015 at 23:22, Subramanya Sastry ssastry@wikimedia.org wrote:
On behalf of the parsing team, here is an update about Parsoid, the bidirectional wikitext <-> HTML parser that supports Visual Editor, Flow, and Content Translation.
eeeexcellent. How close are we to binning the PHP parser? (I realise that's a way off, but grant me my dreams.)
The "PHP parser" used in production has 3 components: the preprocessor, the core parser, Tidy. Parsoid relies on the PHP preprocessor (access via the mediawiki API), so that part of the PHP parser will continue to be in operation.
As noted in my update, we are working towards read views served by Parsoid HTML which requires several ducks to be lined up in a row. When that happens everywhere, the core PHP parser and Tidy will no longer be used.
However, I imagine your question is not so much about the PHP parser ... but more about wikitext and templating. Since I don't want to go off on a tangent here based on an assumption, maybe you can say more what you had in mind when you asked about "binning the PHP parser".
Subbu.
I didn't have anything in mind, evidently I was just vague on what the stuff in there is and does :-)
On 26 June 2015 at 16:52, Subramanya Sastry ssastry@wikimedia.org wrote:
On 06/25/2015 06:29 PM, David Gerard wrote:
On 25 June 2015 at 23:22, Subramanya Sastry ssastry@wikimedia.org wrote:
On behalf of the parsing team, here is an update about Parsoid, the bidirectional wikitext <-> HTML parser that supports Visual Editor, Flow, and Content Translation.
eeeexcellent. How close are we to binning the PHP parser? (I realise that's a way off, but grant me my dreams.)
The "PHP parser" used in production has 3 components: the preprocessor, the core parser, Tidy. Parsoid relies on the PHP preprocessor (access via the mediawiki API), so that part of the PHP parser will continue to be in operation.
As noted in my update, we are working towards read views served by Parsoid HTML which requires several ducks to be lined up in a row. When that happens everywhere, the core PHP parser and Tidy will no longer be used.
However, I imagine your question is not so much about the PHP parser ... but more about wikitext and templating. Since I don't want to go off on a tangent here based on an assumption, maybe you can say more what you had in mind when you asked about "binning the PHP parser".
Subbu.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Fri, Jun 26, 2015 at 11:52 AM, Subramanya Sastry ssastry@wikimedia.org wrote:
On 06/25/2015 06:29 PM, David Gerard wrote:
On 25 June 2015 at 23:22, Subramanya Sastry ssastry@wikimedia.org wrote:
On behalf of the parsing team, here is an update about Parsoid, the
bidirectional wikitext <-> HTML parser that supports Visual Editor, Flow, and Content Translation.
eeeexcellent. How close are we to binning the PHP parser? (I realise that's a way off, but grant me my dreams.)
The "PHP parser" used in production has 3 components: the preprocessor, the core parser, Tidy. Parsoid relies on the PHP preprocessor (access via the mediawiki API), so that part of the PHP parser will continue to be in operation.
As noted in my update, we are working towards read views served by Parsoid HTML which requires several ducks to be lined up in a row. When that happens everywhere, the core PHP parser and Tidy will no longer be used.
Do we have plans for avoiding code rot in "unused" the PHP parser code that would affect smaller third-party sites that don't using Parsoid?
On 06/29/2015 09:20 AM, Brad Jorsch (Anomie) wrote:
On Fri, Jun 26, 2015 at 11:52 AM, Subramanya Sastry ssastry@wikimedia.org wrote:
The "PHP parser" used in production has 3 components: the preprocessor, the core parser, Tidy. Parsoid relies on the PHP preprocessor (access via the mediawiki API), so that part of the PHP parser will continue to be in operation.
As noted in my update, we are working towards read views served by Parsoid HTML which requires several ducks to be lined up in a row. When that happens everywhere, the core PHP parser and Tidy will no longer be used.
Do we have plans for avoiding code rot in "unused" the PHP parser code that would affect smaller third-party sites that don't using Parsoid?
My response to your other email covers quite a bit of this.
As far as I have observed, the PHP parser code has been quite stable for a while. And, small third-party sites are unlikely to have complex requirements and are less likely to hit serious bugs. In any case, we'll do a good-faith effort to keep the PHP parser maintained and we'll fix critical and really high priority bugs. But, simply by virtue of us being a small team with multple reponsibilities, we will prioritize reducing complexity in Parsoid over keeping the PHP parser maintained. In the long run, I think that is a better path to bringing the two systems together.
Subbu.
On Thu, Jun 25, 2015 at 3:22 PM, Subramanya Sastry ssastry@wikimedia.org wrote:
Hello everyone,
On behalf of the parsing team, here is an update about Parsoid, the bidirectional wikitext <-> HTML parser that supports Visual Editor, Flow, and Content Translation.
Subbu.
TL:DR;
- Parsoid[1] roundtrips 99.95% of the 158K pages in round-trip testing without introducing semantic diffs[2].
Congratulations, parsing team. This is very cool.
...and, pssst, wink wink, nudge nudge, etc: http://cacm.acm.org/about-communications/author-center/author-guidelines http://queue.acm.org/author_guidelines.cfm
:)
On 25 June 2015 at 15:22, Subramanya Sastry ssastry@wikimedia.org wrote:
TL:DR;
- Parsoid[1] roundtrips 99.95% of the 158K pages in round-trip testing without introducing semantic diffs[2].
- With trivial simulated edits, the HTML -> wikitext serializer used in production (selective serialization) introduces ZERO dirty diffs in 99.986% of those edits[3]. 10 of those 23 edits with dirty diffs are minor newline diffs.
​Subbu,
You and your team have done, and keep on doing, amazing stuff. Thank you all so very much. "Congratulations" doesn't come close. :-)
Yours,
On Thu, Jun 25, 2015 at 6:22 PM, Subramanya Sastry ssastry@wikimedia.org wrote:
- Pare down rendering differences between the two systems so that we can start thinking about using Parsoid HTML instead of MWParser HTML for read views. ( https://phabricator.wikimedia.org/T55784 )
Any hope of adding the Parsoid metadata to the MWParser HTML so various fancy things can be done in core MediaWiki for smaller installations instead of having to run a separate service? Or does that fall under "Make Parsoid redundant in its current complex avatar"?
On 06/29/2015 09:19 AM, Brad Jorsch (Anomie) wrote:
On Thu, Jun 25, 2015 at 6:22 PM, Subramanya Sastry <ssastry@wikimedia.org mailto:ssastry@wikimedia.org> wrote:
* Pare down rendering differences between the two systems so that we can start thinking about using Parsoid HTML instead of MWParser HTML for read views. ( https://phabricator.wikimedia.org/T55784 )
Any hope of adding the Parsoid metadata to the MWParser HTML so various fancy things can be done in core MediaWiki for smaller installations instead of having to run a separate service? Or does that fall under "Make Parsoid redundant in its current complex avatar"?
Short answer: the latter. Long answer: read on.
Our immediate focus in the coming months would be to bring PHP parser and Parsoid output closer. Some of that work would be to tweak Parsoid output / CSS where required, but also to bring PHP parser output closer to Parsoid output. https://gerrit.wikimedia.org/r/#/c/196532/ is one step along those lines, for example. Scott has said he will review that closely with this goal in mind. Another step is to get rid of Tidy and use a HTML5 compliant tree builder similar to what Parsoid uses.
Beyond these initial steps, bringing the two together (both in terms of output and functionality) will require bridging the computation models ... string-based vs. DOM-based. For example, we cannot really add Parsoid-style metadata for templates to the PHP parser output without being able to analyze the DOM -- that requires us to access the DOM after Tidy (or the Tidy-replacement ideally) has a go at it. It requires us to implement all the dirty tricks we implement to identify template boundaries in the presence of unclosed tags, misnested tags, fostered content from tables, and dom restructuring the HTML tree builder does to comply with HTML5 semantics.
Besides that, if you want to also serialize this back to wikitext without introducing dirty diffs (there is really no reason to do all this extra work if you cannot also serialize it back to wikitext), you also need to be able to either (a) maintain a lot of extra state in the DOM beyond what Parsoid maintains, or (b) do all the additional work that Parsoid does to maintain an extremely precise mapping between wikitext strings and DOM trees. Once again, the only reason (b) is complicated is because of unclosed tags, misnested tags, fostered content, DOM restructuring because of HTML5 semantics.
There is a fair amount of complexity hidden there in those 2 steps, and it really does not make sense to reimplement all of that in the PHP parser. If you do, at that point, you've effectively reimplemented Parsoid in PHP -- the PHP parser in its current form is unlikely to stay as is.
So, the only real way out here is to move the wikitext computational model closer to a DOM model. This is not a done deal really, but we have talked about several ideas over the last couple years to move this forward in increments. I don't want to go into a lot of detail in this email since this is already getting lengthy, but I am happy to talk more about it if there is interest.
To summarize, here are the steps as we see it:
* Bring PHP parser and Parsoid output as close as we can (replace Tidy, fix PHP parser output wherever possible to be closer to Parsoid output). * Incrementally move wikitext computational model to be DOM based using Parsoid as the bridge that preserves compatibility. This is easier if we have removed Tidy from the equation. * Smoothen out the harder edge cases which simplifies the problem and eliminates the complexity * At this point, Parsoid current complexity will be unnecessary (specifics dependent on previous steps) => you could have this functionality back in PHP if it is so desired. But, by then, hopefully, there will also be better clarity about mediawiki packaging that will also influence this. Or, some small wikis might decide to be HTML-only wikis.
Subbu.
wikitech-l@lists.wikimedia.org