Interesting. What is the cause of the slower speed?
- Trevor
On Tuesday, August 11, 2015, Gabriel Wicke <gwicke(a)wikimedia.org> wrote:
On Tue, Aug 11, 2015 at 5:16 PM, Trevor Parscal
<tparscal(a)wikimedia.org
<javascript:;>>
wrote:
Is it possible use part of the Parsoid code to do
this?
It is possible to do this in Parsoid (or any node service) with this line:
var sanerHTML = domino.createDocument(input).outerHTML;
However, performance is about 2x worse than current tidy (116ms vs. 238ms
for Obama), and about 4x slower than the fastest option in our tests. The
task has a lot more benchmarks of various options.
Gabriel
- Trevor
On Tuesday, August 11, 2015, Tim Starling <tstarling(a)wikimedia.org
<javascript:;>> wrote:
I'm elevating this task of mine to RFC
status:
https://phabricator.wikimedia.org/T89331
Running the output of the MediaWiki parser through HTML Tidy always
seemed like a nasty hack. The effects on wikitext syntax are arbitrary
and change from version to version. When we upgrade our Linux
distribution, we sometimes see changes in the HTML generated by given
wikitext, which is not ideal.
Parsoid took a different approach. After token-level transformations,
tokens are fed into the HTML 5 parse algorithm, a complex but
well-specified algorithm which generates a DOM tree from quirky input
text.
http://www.w3.org/TR/html5/syntax.html
We can get nearly the same effect in MediaWiki by replacing the Tidy
transformation stage with an HTML 5 parse followed by serialization of
the DOM back to HTML. This would stabilize wikitext syntax and resolve
several important syntax differences compared to Parsoid.
However:
* I have not been able to find any PHP implementation of this
algorithm. Masterminds and Ressio do not even attempt it. Electrolinux
attempts it but does not implement the error recovery parts that are
of interest to us.
* Writing our own would be difficult.
* Even if we did write it, it would probably be too slow.
So the question is: what language should we use? Since this is the
standard programmer troll question, please bring popcorn.
The best implementation of this algorithm is in Java: the validator.nu
parser is maintained by Mozilla, and has source translation to C++,
which is used by Mozilla and could potentially be used for an HHVM
extension.
There is also a Rust port (also written by Mozilla), and notable
implementations in JavaScript and Python.
For WMF, a Java service would be quite easily done, and I have
prototyped it already. An HHVM extension might also be possible. A
non-service fallback for small installations might be Node.js or a
compiled binary from Rust or C++.
-- Tim Starling
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org <javascript:;> <javascript:;>
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org <javascript:;>
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
--
Gabriel Wicke
Principal Engineer, Wikimedia Foundation
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org <javascript:;>
https://lists.wikimedia.org/mailman/listinfo/wikitech-l