On behalf of the parsing team, here is an update about Parsoid, the
bidirectional wikitext <-> HTML parser that supports Visual Editor,
Flow, and Content Translation.
1. Parsoid roundtrips 99.95% of the 158K pages in round-trip testing
without introducing semantic diffs.
2. With trivial simulated edits, the HTML -> wikitext serializer used
in production (selective serialization) introduces ZERO dirty diffs
in 99.986% of those edits. 10 of those 23 edits with dirty diffs
are minor newline diffs.
Couple days back (June 23rd), Parsoid achieved 99.95% semantic accuracy
in the wikitext -> HTML -> wikitext roundtripping process on the set of
about 158K pages randomly picked from about 16 wikis back in 2013.
Keeping this test set constant has let us monitor our progress over time.
We were at 99.75% last year around this time.
What does this mean?
* Despite the practical complexities of wikitext, the mismatch in the
processing models of wikitext (string-based) and Parsoid (DOM-based),
and the various wikitext "errors" that are found on pages, Parsoid is
to maintain a reversible mapping between wikitext constructs and their
equivalent HTML DOM trees that HTML editors and other tools can
The majority of differences in the 0.05% arise because of wikitext
links in links, 'fosterable' content in tables, and some scenarios
with unmatched quotes in attributes. Parsoid does not support
round-tripping (RT) of these.
* While this is not a big change from how it has been for about a year now
in terms of Parsoid's support for editing, this is a notable milestone
for us in terms of the confidence we have in Parsoid's ability to handle
the wikitext usage seen in production wikis and our ability to RT them
accurately without corrupting pages. This should also boost confidence
of all applications that rely on Parsoid.
* In production, Parsoid uses a selective serialization strategy which
tries to preserve unedited parts of wikitext as far as possible.
As part of regular testing, we also simulate a trivial edit by adding
a new comment to the page and run the edited HTML through this
selective serializer. All but 23 pages (0.014% of trivial edits) had
ZERO dirty diffs. Of these 23, 10 of the diffs were minor newline
In production, the dirty diff rate will be higher than 0.014% because of
more complex edits and because of bugs in any of 3 components involved
in visual editing on Wikipedias (Parsoid, RESTBase and Visual Editor)
and their interaction. But, the base accuracy of Parsoid's roundtripping
(both in terms of full and selective serialization) is critical to
clean visual edits. The above milestones are part of ensuring that.
What does this not mean?
* If you edit one of those 0.05% of pages in VE, the VE-Parsoid combination
will break the page. NO!
If you edit the broken part of the page, Parsoid will very likely
the broken wikitext to the non-erroneous form (break up nested links,
move fostered content out of the table, drop duplicate transclusion
parameters, etc.) In the odd case, it could cause a dirty diff that
the semantics of those broken constructs.
* Parsoid's visual rendering is 99.95% identical to PHP parser
RT tests are focused on Parsoid's ability to support editing without
introducing dirty diffs. Even though Parsoid might render a page
differently than the default read view (and might even be incorrect),
we are nevertheless able to RT it without breaking the wikitext.
On the way to getting to 99.95% RT accuracy, we have improved and fixed
several bugs in Parsoid's rendering. The rendering is also fairly
to the default read view (otherwise, VE editors will definitely
However, we haven't done sufficient testing to systematically identify
rendering incompatibilities and quantify this. In the coming quarters,
we are going to turn our attention to this problem. We have a visual
diffing infrastructure to help us with this (we take screenshots of
Parsoid's output and the default output and compare those images and find
diffs). We'll have to tweak and fix our visual-diffing setup and then fix
rendering problems we find.
* 100% roundtripping accuracy is within reach. NO!
The reality is that there are a lot of pages out there that have various
kinds of broken markup (mis-nested html tags, unmatched html tags,
broken templates) in production. There are probably other edge case
scenarios that trigger different behavior in Parsoid and the PHP parser.
Because we go to great lengths in Parsoid to avoid dirty diffs, our
selective serialization works quite well. There have been very few
of page corruption over the last year. And, where they have surfaced,
usually moved pretty quickly to fix them, and we'll continue to do so.
In addition, our diff classification algo will never be perfect and there
will always be false positives. Overall, we may crawl further along by
0.01% or 0.02%, but we are not holding our breath and neither should you.
* If we pick a new corpus of 100K pages, we'll have similar accuracy. MAYBE!
Because we've tested against a random sample of pages across multiple
Wikipedias, we expect that we've encountered the vast majority of
that Parsoid will encounter in production. So, we have a very high degree
of confidence that our fixes are not tailored to our test pages.
As part of https://phabricator.wikimedia.org/T101928
we will be doing
a refresh of our test set, focusing more on enwp pages, non-Wikipedia
pages, and probably introducing a set of high traffic pages.
Given where we are now, we can now start thinking about the next level with
a bit more focus and energy. Our next steps are to bring the PHP parser and
Parsoid closer both in terms of output and long-term capabilities.
* Replace Tidy ( https://phabricator.wikimedia.org/T89331
* Pare down rendering differences between the two systems so that
we can start thinking about using Parsoid HTML instead of MWParser HTML
for read views. ( https://phabricator.wikimedia.org/T55784
* Use Parsoid as a WikiLint tool
* Support improved templating abilities (data-driven tables, etc.)
* Improve Parsoid's parsing performance.
* Implement stable ids to be able to attach long-lived metadata to the
DOM and track it across edits.
* Move wikitext to a DOM-based processing model, using Parsoid as a bridge.
This could make several useful things possible, e.g. much better
automatic edit conflict resolution.
* Long-term: Make Parsoid redundant in its current complex avatar.
-- bidirectional parser
shows the actual failures