Re: [Wikitech-l] RFC: Replace Tidy with HTML 5 parse/reserialize

12 Aug 2015

      Interesting. What is the cause of the slower speed?
- Trevor
On Tuesday, August 11, 2015, Gabriel Wicke gwicke@wikimedia.org wrote:
...
On Tue, Aug 11, 2015 at 5:16 PM, Trevor Parscal <tparscal@wikimedia.org
javascript:;>
wrote:
...
Is it possible use part of the Parsoid code to do this?
It is possible to do this in Parsoid (or any node service) with this line:
var sanerHTML = domino.createDocument(input).outerHTML;
However, performance is about 2x worse than current tidy (116ms vs. 238ms
for Obama), and about 4x slower than the fastest option in our tests. The
task has a lot more benchmarks of various options.
Gabriel
...

Trevor

On Tuesday, August 11, 2015, Tim Starling <tstarling@wikimedia.org
javascript:;> wrote:
...
...
I'm elevating this task of mine to RFC status:
https://phabricator.wikimedia.org/T89331
Running the output of the MediaWiki parser through HTML Tidy always
seemed like a nasty hack. The effects on wikitext syntax are arbitrary
and change from version to version. When we upgrade our Linux
distribution, we sometimes see changes in the HTML generated by given
wikitext, which is not ideal.
Parsoid took a different approach. After token-level transformations,
tokens are fed into the HTML 5 parse algorithm, a complex but
well-specified algorithm which generates a DOM tree from quirky input
text.
http://www.w3.org/TR/html5/syntax.html
We can get nearly the same effect in MediaWiki by replacing the Tidy
transformation stage with an HTML 5 parse followed by serialization of
the DOM back to HTML. This would stabilize wikitext syntax and resolve
several important syntax differences compared to Parsoid.
However:

I have not been able to find any PHP implementation of this

algorithm. Masterminds and Ressio do not even attempt it. Electrolinux
attempts it but does not implement the error recovery parts that are
of interest to us.

Writing our own would be difficult.
Even if we did write it, it would probably be too slow.

So the question is: what language should we use? Since this is the
standard programmer troll question, please bring popcorn.
The best implementation of this algorithm is in Java: the validator.nu
parser is maintained by Mozilla, and has source translation to C++,
which is used by Mozilla and could potentially be used for an HHVM
extension.
There is also a Rust port (also written by Mozilla), and notable
implementations in JavaScript and Python.
For WMF, a Java service would be quite easily done, and I have
prototyped it already. An HHVM extension might also be possible. A
non-service fallback for small installations might be Node.js or a
compiled binary from Rust or C++.
-- Tim Starling

Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org javascript:; javascript:;
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org javascript:;
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
--
Gabriel Wicke
Principal Engineer, Wikimedia Foundation
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org javascript:;
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] RFC: Replace Tidy with HTML 5 parse/reserialize