New subject: [Engineering] Parsoid announcement: Main roundtrip quality target achieved

25 Jun 2015


      Hello everyone,
On behalf of the parsing team, here is an update about Parsoid, the 
bidirectional wikitext <-> HTML parser that supports  Visual Editor, 
Flow, and Content Translation.
Subbu.
-----------------------------------------------------------------------
TL:DR;
1. Parsoid[1] roundtrips 99.95% of the 158K pages in round-trip testing
    without introducing semantic diffs[2].
2. With trivial simulated edits, the HTML -> wikitext serializer used
    in production (selective serialization) introduces ZERO dirty diffs
    in 99.986% of those edits[3]. 10 of those 23 edits with dirty diffs
    are minor newline diffs.
-----------------------------------------------------------------------
Couple days back (June 23rd), Parsoid achieved 99.95%[2] semantic accuracy
in the wikitext -> HTML -> wikitext roundtripping process on the set of
about 158K pages randomly picked from about 16 wikis back in 2013.
Keeping this test set constant has let us monitor our progress over time.
We were at 99.75% last year around this time.
What does this mean?
--------------------
* Despite the practical complexities of wikitext, the mismatch in the
   processing models of wikitext (string-based) and Parsoid (DOM-based),
   and the various wikitext "errors" that are found on pages, Parsoid is 
able
   to maintain a reversible mapping between wikitext constructs and their
   equivalent HTML DOM trees that HTML editors and other tools can 
manipulate.
The majority of differences in the 0.05% arise because of wikitext 
errors:
   links in links, 'fosterable'[4] content in tables, and some scenarios
   with unmatched quotes in attributes. Parsoid does not support
   round-tripping (RT) of these.
* While this is not a big change from how it has been for about a year now
   in terms of Parsoid's support for editing, this is a notable milestone
   for us in terms of the confidence we have in Parsoid's ability to handle
   the wikitext usage seen in production wikis and our ability to RT them
   accurately without corrupting pages. This should also boost confidence
   of all applications that rely on Parsoid.
* In production, Parsoid uses a selective serialization strategy which
   tries to preserve unedited parts of wikitext as far as possible.
As part of regular testing, we also simulate a trivial edit by adding
   a new comment to the page and run the edited HTML through this
   selective serializer. All but 23 pages (0.014% of trivial edits) had
   ZERO dirty diffs[3]. Of these 23, 10 of the diffs were minor newline 
diffs.
In production, the dirty diff rate will be higher than 0.014% because of
   more complex edits and because of bugs in any of 3 components involved
   in visual editing on Wikipedias (Parsoid, RESTBase[5] and Visual Editor)
   and their interaction. But, the base accuracy of Parsoid's roundtripping
   (both in terms of full and selective serialization) is critical to 
ensuring
   clean visual edits. The above milestones are part of ensuring that.
What does this not mean?
------------------------
* If you edit one of those 0.05% of pages in VE, the VE-Parsoid combination
   will break the page. NO!
If you edit the broken part of the page, Parsoid will very likely 
normalize
   the broken wikitext to the non-erroneous form (break up nested links,
   move fostered content out of the table, drop duplicate transclusion
   parameters, etc.) In the odd case, it could cause a dirty diff that 
changes
   the semantics of those broken constructs.
* Parsoid's visual rendering is 99.95% identical to PHP parser 
rendering. NO!
RT tests are focused on Parsoid's ability to support editing without
   introducing dirty diffs. Even though Parsoid might render a page
   differently than the default read view (and might even be incorrect),
   we are nevertheless able to RT it without breaking the wikitext.
On the way to getting to 99.95% RT accuracy, we have improved and fixed
   several bugs in Parsoid's rendering. The rendering is also fairly 
identical
   to the default read view (otherwise, VE editors will definitely 
complain).
   However, we haven't done sufficient testing to systematically identify
   rendering incompatibilities and quantify this. In the coming quarters,
   we are going to turn our attention to this problem. We have a visual
   diffing infrastructure to help us with this (we take screenshots of
   Parsoid's output and the default output and compare those images and find
   diffs). We'll have to tweak and fix our visual-diffing setup and then fix
   rendering problems we find.
* 100% roundtripping accuracy is within reach. NO!
The reality is that there are a lot of pages out there that have various
   kinds of broken markup (mis-nested html tags, unmatched html tags,
   broken templates) in production. There are probably other edge case
   scenarios that trigger different behavior in Parsoid and the PHP parser.
   Because we go to great lengths in Parsoid to avoid dirty diffs, our
   selective serialization works quite well. There have been very few 
reports
   of page corruption over the last year. And, where they have surfaced, 
we've
   usually moved pretty quickly to fix them, and we'll continue to do so.
In addition, our diff classification algo will never be perfect and there
   will always be false positives. Overall, we may crawl further along by
   0.01% or 0.02%, but we are not holding our breath and neither should you.
* If we pick a new corpus of 100K pages, we'll have similar accuracy. MAYBE!
Because we've tested against a random sample of pages across multiple
   Wikipedias, we expect that we've encountered the vast majority of 
scenarios
   that Parsoid will encounter in production. So, we have a very high degree
   of confidence that our fixes are not tailored to our test pages.
As part of https://phabricator.wikimedia.org/T101928 we will be doing
   a refresh of our test set, focusing more on enwp pages, non-Wikipedia 
test
   pages, and probably introducing a set of high traffic pages.
Next steps
----------
Given where we are now, we can now start thinking about the next level with
a bit more focus and energy. Our next steps are to bring the PHP parser and
Parsoid closer both in terms of output and long-term capabilities.
Some possibilities:
* Replace Tidy ( https://phabricator.wikimedia.org/T89331 )
* Pare down rendering differences between the two systems so that
   we can start thinking about using Parsoid HTML instead of MWParser HTML
   for read views. ( https://phabricator.wikimedia.org/T55784 )
* Use Parsoid as a WikiLint tool
   https://phabricator.wikimedia.org/T48705
https://www.mediawiki.org/wiki/Parsoid/Linting/GSoC_2014_Application
* Support improved templating abilities (data-driven tables, etc.)
* Improve Parsoid's parsing performance.
* Implement stable ids to be able to attach long-lived metadata to the
   DOM and track it across edits.
* Move wikitext to a DOM-based processing model, using Parsoid as a bridge.
   This could make several useful things possible, e.g. much better
   automatic edit conflict resolution.
* Long-term: Make Parsoid redundant in its current complex avatar.
References
----------
[1] https://www.mediawiki.org/wiki/Parsoid -- bidirectional parser 
supporting
     visual editing
[2] http://parsoid-tests.wikimedia.org/failsDistr
     http://parsoid-tests.wikimedia.org/topfails shows the actual failures
[3] 
http://parsoid-tests.wikimedia.org/rtselsererrors/aa5804ca89dc644f744af24c47...
[4] http://dev.w3.org/html5/spec-LC/tree-construction.html#foster-parenting
[5] https://www.mediawiki.org/wiki/RESTBase#Use_cases