On 11/09/2015 12:37 PM, Petr Bena wrote:
Do you really want to say that reading from disk is faster than processing the text using CPU? I don't know how complex syntax of mw actually is, but C++ compilers are probably much faster than parsoid, if that's true. And these are very slow.
What takes so much CPU time in turning wikitext into html? Sounds like JS wasn't a best choice here.
The problem is not turning wikitext into HTML, but turning it into HTML so that it can be turned back into wikitext when it is edited and doing it in such a way that you don't introduce dirty diffs.
That requires keeping around state, tracking things in wikitext closely, and doing a lot more analysis.
That means detecting markup errors, and retaining error recovery information so that you can account for it during analysis, and also so you can reintroduce the markup errors when you convert the html back to wikitext. This is the reason why we proposed https://phabricator.wikimedia.org/T48705 since we already have all the information about broken wikitext usage.
If you are interested in more details, either show up on #mediawiki-parsoid, or look at this april 2014 tech-talk: A preliminary look at Parsoid internals [ Slides https://commons.wikimedia.org/wiki/File:Parsoid.techtalk.apr15.2014.pdf, Video https://www.youtube.com/watch?v=Eb5Ri0xqEzk ]. It has some details.
So, TL:DR; is Parsoid is a *bi-directional* wikitext <-> html bridge and doing that is non-trivial.
Subbu.