On 01/30/2013 12:36 AM, Ariel T. Glenn wrote:
Στις 23-01-2013, ημέρα Τετ, και ώρα 15:10 -0800, ο/η Gabriel Wicke έγραψε:
Fellow MediaWiki hackers!
After the pretty successful December release and some more clean-up work following up on that we are now considering the next steps for Parsoid. To this end, we have put together a rough roadmap for the Parsoid project at
On thing that jumped out at me is this:
"We have also decided to narrow our focus a bit by continuing to use the PHP preprocessor to perform our template expansion."
While I understand the reasoning and even sympathize with it, I had hoped that Parsoid, when complete, would facilitate the implementation of wikitext parsers apart from the canonical parser (i.e. MediaWiki), with clearly defined behavior for the language including templates. Is that idea dead then?
As it exists, Parsoid can tackle full template expansion -- but, since it does not support all parser functions natively, this is still incomplete, and we can bypass the need for the most part by relying on the PHP preprocessor to give us fully expanded wikitext which we process further.
We are refocusing our efforts towards exploring HTML-based templating -- while supporting existing templates. Lua based templates already clean up a lot of template logic by have access to full conditional logic. By relying more on DOM-based templates (which would also be editable in a visual-editor like client), the expectation is that direct wikitext use itself will progressively diminish. Since most wikitext and probably Lua templates already return well-formed DOM (not all do), by simply adding a parse layer on top of them, they can be supported in a DOM-only templating framework. So, the first outcome of this effort would be to require templates to return DOM fragment always.
In such a diminished-use scenario, we do not see the need for focusing a lot of energy and effort in attaining full compatibility entirely in Parsoid. We see Parsoid+PHP parser as providing legacy wikitext support while a large chunk of editing and storage happens in the HTML world. We can then take it from there based on how far this strategy takes us. If there still remains a need for a full replacement wikitext evaluation system to be in place (because of continuing popularity of wikitext or because of performance reasons or whatever else), that option remains open and is not closed at this time.
Even so, there is still possibility of identifying "erroneous" or "undefined behavior" wikitext markup within Parsoid (in quotes, because anything that is thrown at the php parser and parsoid needs to be rendered always). We can detect, for example, missing opening/closing html tags (since we currently have to do that for roundtripping them properly without introducing dirty diffs), detect unbalanced tags in certain contexts by treating them as balanced-DOM contexts (image captions, extensions), and other such scenarios. We also have been adding a number of parser tests that try to specify edge case behaviors and make a call as to whether it is legitimate behavior or undefined behavior. All of this could be used in some mode to issue warnings in some lint-like mode, which can then serve to be a de-facto definition of legitimate wikitext since there is no possibility of a grammar-based definition for wikitext.
So, while we are not focusing on attaining full replacement capability in Parsoid, our new directions do not entirely do away with the idea that you alluded to: (1) we are attempting to move towards templates (DOM/Lua/wikitext) that can only return DOM fragments (2) we retain the ability to provide some kind of linting ability in Parsoid (but this functionality is not at the top of our todo list since we are focused on reducing the scope of wikitext use over the long-term, while providing full compatibility in the immediate and short-term).
Does that answer your question?
Subbu.
PS: The other primary reason for going with a new wikitext evaluator/runtime (more accurate than calling this a parser), possibly in c++, was performance -- but we are going at it in a different way already based on the notion that most edits on wiki pages are going to be "minor" edits (relative to the size of the page). If so, there is no sense in fully serializing it and fully reparsing it on every such minor edit -- it is a waste of server resources. Since we now have a fully RT-able HTML representation of wikitext, selective serialization (HTML->wikitext) selective reparsing (of wikitext-based edits that happen outside the VE), along with caching of DOM fragments (transclusions, etc) should take care of the performance issue -- these are addressed in the RFC.