On 01/30/2013 12:36 AM, Ariel T. Glenn wrote:
Στις 23-01-2013, ημέρα Τετ, και ώρα 15:10 -0800, ο/η
Gabriel Wicke
έγραψε:
Fellow MediaWiki hackers!
After the pretty successful December release and some more clean-up work
following up on that we are now considering the next steps for Parsoid.
To this end, we have put together a rough roadmap for the Parsoid project at
https://www.mediawiki.org/wiki/Parsoid/Roadmap On thing that jumped out at me
is this:
"We have also decided to narrow our focus a bit by continuing to use the
PHP preprocessor to perform our template expansion."
While I understand the reasoning and even sympathize with it, I had
hoped that Parsoid, when complete, would facilitate the implementation
of wikitext parsers apart from the canonical parser (i.e. MediaWiki),
with clearly defined behavior for the language including templates. Is
that idea dead then?
As it exists, Parsoid can tackle full template expansion -- but, since
it does not support all parser functions natively, this is still
incomplete, and we can bypass the need for the most part by relying on
the PHP preprocessor to give us fully expanded wikitext which we process
further.
We are refocusing our efforts towards exploring HTML-based templating --
while supporting existing templates. Lua based templates already clean
up a lot of template logic by have access to full conditional logic.
By relying more on DOM-based templates (which would also be editable in
a visual-editor like client), the expectation is that direct wikitext
use itself will progressively diminish. Since most wikitext and
probably Lua templates already return well-formed DOM (not all do), by
simply adding a parse layer on top of them, they can be supported in a
DOM-only templating framework. So, the first outcome of this effort
would be to require templates to return DOM fragment always.
In such a diminished-use scenario, we do not see the need for focusing a
lot of energy and effort in attaining full compatibility entirely in
Parsoid. We see Parsoid+PHP parser as providing legacy wikitext support
while a large chunk of editing and storage happens in the HTML world.
We can then take it from there based on how far this strategy takes us.
If there still remains a need for a full replacement wikitext evaluation
system to be in place (because of continuing popularity of wikitext or
because of performance reasons or whatever else), that option remains
open and is not closed at this time.
Even so, there is still possibility of identifying "erroneous" or
"undefined behavior" wikitext markup within Parsoid (in quotes, because
anything that is thrown at the php parser and parsoid needs to be
rendered always). We can detect, for example, missing opening/closing
html tags (since we currently have to do that for roundtripping them
properly without introducing dirty diffs), detect unbalanced tags in
certain contexts by treating them as balanced-DOM contexts (image
captions, extensions), and other such scenarios. We also have been
adding a number of parser tests that try to specify edge case behaviors
and make a call as to whether it is legitimate behavior or undefined
behavior. All of this could be used in some mode to issue warnings in
some lint-like mode, which can then serve to be a de-facto definition of
legitimate wikitext since there is no possibility of a grammar-based
definition for wikitext.
So, while we are not focusing on attaining full replacement capability
in Parsoid, our new directions do not entirely do away with the idea
that you alluded to: (1) we are attempting to move towards templates
(DOM/Lua/wikitext) that can only return DOM fragments (2) we retain the
ability to provide some kind of linting ability in Parsoid (but this
functionality is not at the top of our todo list since we are focused on
reducing the scope of wikitext use over the long-term, while providing
full compatibility in the immediate and short-term).
Does that answer your question?
Subbu.
PS: The other primary reason for going with a new wikitext
evaluator/runtime (more accurate than calling this a parser), possibly
in c++, was performance -- but we are going at it in a different way
already based on the notion that most edits on wiki pages are going to
be "minor" edits (relative to the size of the page). If so, there is no
sense in fully serializing it and fully reparsing it on every such minor
edit -- it is a waste of server resources. Since we now have a fully
RT-able HTML representation of wikitext, selective serialization
(HTML->wikitext) selective reparsing (of wikitext-based edits that
happen outside the VE), along with caching of DOM fragments
(transclusions, etc) should take care of the performance issue -- these
are addressed in the RFC.