On Tue, Aug 11, 2015 at 5:16 PM, Trevor Parscal tparscal@wikimedia.org wrote:
Is it possible use part of the Parsoid code to do this?
It is possible to do this in Parsoid (or any node service) with this line:
var sanerHTML = domino.createDocument(input).outerHTML;
However, performance is about 2x worse than current tidy (116ms vs. 238ms for Obama), and about 4x slower than the fastest option in our tests. The task has a lot more benchmarks of various options.
Gabriel
- Trevor
On Tuesday, August 11, 2015, Tim Starling tstarling@wikimedia.org wrote:
I'm elevating this task of mine to RFC status:
https://phabricator.wikimedia.org/T89331
Running the output of the MediaWiki parser through HTML Tidy always seemed like a nasty hack. The effects on wikitext syntax are arbitrary and change from version to version. When we upgrade our Linux distribution, we sometimes see changes in the HTML generated by given wikitext, which is not ideal.
Parsoid took a different approach. After token-level transformations, tokens are fed into the HTML 5 parse algorithm, a complex but well-specified algorithm which generates a DOM tree from quirky input text.
http://www.w3.org/TR/html5/syntax.html
We can get nearly the same effect in MediaWiki by replacing the Tidy transformation stage with an HTML 5 parse followed by serialization of the DOM back to HTML. This would stabilize wikitext syntax and resolve several important syntax differences compared to Parsoid.
However:
- I have not been able to find any PHP implementation of this
algorithm. Masterminds and Ressio do not even attempt it. Electrolinux attempts it but does not implement the error recovery parts that are of interest to us.
- Writing our own would be difficult.
- Even if we did write it, it would probably be too slow.
So the question is: what language should we use? Since this is the standard programmer troll question, please bring popcorn.
The best implementation of this algorithm is in Java: the validator.nu parser is maintained by Mozilla, and has source translation to C++, which is used by Mozilla and could potentially be used for an HHVM extension.
There is also a Rust port (also written by Mozilla), and notable implementations in JavaScript and Python.
For WMF, a Java service would be quite easily done, and I have prototyped it already. An HHVM extension might also be possible. A non-service fallback for small installations might be Node.js or a compiled binary from Rust or C++.
-- Tim Starling
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org javascript:; https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l