Hello everyone,
On behalf of the parsing team, here is an update about Parsoid, the bidirectional wikitext <-> HTML parser that supports Visual Editor, Flow, and Content Translation.
Subbu.
----------------------------------------------------------------------- TL:DR;
1. Parsoid[1] roundtrips 99.95% of the 158K pages in round-trip testing without introducing semantic diffs[2]. 2. With trivial simulated edits, the HTML -> wikitext serializer used in production (selective serialization) introduces ZERO dirty diffs in 99.986% of those edits[3]. 10 of those 23 edits with dirty diffs are minor newline diffs. -----------------------------------------------------------------------
Couple days back (June 23rd), Parsoid achieved 99.95%[2] semantic accuracy in the wikitext -> HTML -> wikitext roundtripping process on the set of about 158K pages randomly picked from about 16 wikis back in 2013. Keeping this test set constant has let us monitor our progress over time. We were at 99.75% last year around this time.
What does this mean? -------------------- * Despite the practical complexities of wikitext, the mismatch in the processing models of wikitext (string-based) and Parsoid (DOM-based), and the various wikitext "errors" that are found on pages, Parsoid is able to maintain a reversible mapping between wikitext constructs and their equivalent HTML DOM trees that HTML editors and other tools can manipulate.
The majority of differences in the 0.05% arise because of wikitext errors: links in links, 'fosterable'[4] content in tables, and some scenarios with unmatched quotes in attributes. Parsoid does not support round-tripping (RT) of these.
* While this is not a big change from how it has been for about a year now in terms of Parsoid's support for editing, this is a notable milestone for us in terms of the confidence we have in Parsoid's ability to handle the wikitext usage seen in production wikis and our ability to RT them accurately without corrupting pages. This should also boost confidence of all applications that rely on Parsoid.
* In production, Parsoid uses a selective serialization strategy which tries to preserve unedited parts of wikitext as far as possible.
As part of regular testing, we also simulate a trivial edit by adding a new comment to the page and run the edited HTML through this selective serializer. All but 23 pages (0.014% of trivial edits) had ZERO dirty diffs[3]. Of these 23, 10 of the diffs were minor newline diffs.
In production, the dirty diff rate will be higher than 0.014% because of more complex edits and because of bugs in any of 3 components involved in visual editing on Wikipedias (Parsoid, RESTBase[5] and Visual Editor) and their interaction. But, the base accuracy of Parsoid's roundtripping (both in terms of full and selective serialization) is critical to ensuring clean visual edits. The above milestones are part of ensuring that.
What does this not mean? ------------------------ * If you edit one of those 0.05% of pages in VE, the VE-Parsoid combination will break the page. NO!
If you edit the broken part of the page, Parsoid will very likely normalize the broken wikitext to the non-erroneous form (break up nested links, move fostered content out of the table, drop duplicate transclusion parameters, etc.) In the odd case, it could cause a dirty diff that changes the semantics of those broken constructs.
* Parsoid's visual rendering is 99.95% identical to PHP parser rendering. NO!
RT tests are focused on Parsoid's ability to support editing without introducing dirty diffs. Even though Parsoid might render a page differently than the default read view (and might even be incorrect), we are nevertheless able to RT it without breaking the wikitext.
On the way to getting to 99.95% RT accuracy, we have improved and fixed several bugs in Parsoid's rendering. The rendering is also fairly identical to the default read view (otherwise, VE editors will definitely complain). However, we haven't done sufficient testing to systematically identify rendering incompatibilities and quantify this. In the coming quarters, we are going to turn our attention to this problem. We have a visual diffing infrastructure to help us with this (we take screenshots of Parsoid's output and the default output and compare those images and find diffs). We'll have to tweak and fix our visual-diffing setup and then fix rendering problems we find.
* 100% roundtripping accuracy is within reach. NO!
The reality is that there are a lot of pages out there that have various kinds of broken markup (mis-nested html tags, unmatched html tags, broken templates) in production. There are probably other edge case scenarios that trigger different behavior in Parsoid and the PHP parser. Because we go to great lengths in Parsoid to avoid dirty diffs, our selective serialization works quite well. There have been very few reports of page corruption over the last year. And, where they have surfaced, we've usually moved pretty quickly to fix them, and we'll continue to do so.
In addition, our diff classification algo will never be perfect and there will always be false positives. Overall, we may crawl further along by 0.01% or 0.02%, but we are not holding our breath and neither should you.
* If we pick a new corpus of 100K pages, we'll have similar accuracy. MAYBE!
Because we've tested against a random sample of pages across multiple Wikipedias, we expect that we've encountered the vast majority of scenarios that Parsoid will encounter in production. So, we have a very high degree of confidence that our fixes are not tailored to our test pages.
As part of https://phabricator.wikimedia.org/T101928 we will be doing a refresh of our test set, focusing more on enwp pages, non-Wikipedia test pages, and probably introducing a set of high traffic pages.
Next steps ---------- Given where we are now, we can now start thinking about the next level with a bit more focus and energy. Our next steps are to bring the PHP parser and Parsoid closer both in terms of output and long-term capabilities.
Some possibilities: * Replace Tidy ( https://phabricator.wikimedia.org/T89331 ) * Pare down rendering differences between the two systems so that we can start thinking about using Parsoid HTML instead of MWParser HTML for read views. ( https://phabricator.wikimedia.org/T55784 ) * Use Parsoid as a WikiLint tool https://phabricator.wikimedia.org/T48705 https://www.mediawiki.org/wiki/Parsoid/Linting/GSoC_2014_Application * Support improved templating abilities (data-driven tables, etc.) * Improve Parsoid's parsing performance. * Implement stable ids to be able to attach long-lived metadata to the DOM and track it across edits. * Move wikitext to a DOM-based processing model, using Parsoid as a bridge. This could make several useful things possible, e.g. much better automatic edit conflict resolution. * Long-term: Make Parsoid redundant in its current complex avatar.
References ---------- [1] https://www.mediawiki.org/wiki/Parsoid -- bidirectional parser supporting visual editing [2] http://parsoid-tests.wikimedia.org/failsDistr http://parsoid-tests.wikimedia.org/topfails shows the actual failures [3] http://parsoid-tests.wikimedia.org/rtselsererrors/aa5804ca89dc644f744af24c47... [4] http://dev.w3.org/html5/spec-LC/tree-construction.html#foster-parenting [5] https://www.mediawiki.org/wiki/RESTBase#Use_cases