On 07/31/2015 12:55 PM, Ricordisamoa wrote:
Hi Subbu,
thank you for this thoughtful insight.
And thank you for starting this thread. :-)
HTML is not a barrier by itself. The problem seems to
be Parsoid being
built primarily with VisualEditor in mind.
While we want the DOM to be VE-friendly, we definitely don't want the
DOM to be VE-centric and that has been the intention from the very
beginning. Flow, CX also use the Parsoid DOM for their functionality.
There are other users too [1]. We definitely want Parsoid's output to be
useful and usable more broadly as the canonical output representation of
wikitext and are open to fixing whatever prevents that.
As Scott noted in the other email on the thread, inspired (and maybe
challenged by :-) ) by mwparserfromhell's utilities, he has already
whipped out a layer that provides an easier interface for manipulating
the DOM.
It is not clear to me how can a single DOM serving
both view and edit
modes avoid redundancy.
You are right that there are some redundancies in information
representation (because of having to serve multiple needs), but as far
as I know, it is mostly around image attributes. If there is anything
else specific (beyond image attributes) that is bothering you, can you
flag that?
I see huge demand for alternative wikignome-style
editors. The more
Parsoid's DOM is predictable, concise and documented, the more users
you get.
I think Parsoid's DOM is predictable :-) but, can you say more about
what prompted you to say that? As for documentation, we document the DOM
we generate and its semantics here [2]. As for size, I just looked at
the Barack Obama page and here are some size numbers.
1540407 /tmp/Barack_Obama.parsoid.html
1197318 /tmp/Barack_Obama.parsoid.no-data-mw.html
1045161 /tmp/Barack_Obama.php-parser.output.footer-stripped.html
Right now, because we inline template and other editable information (as
inline JSON attributes of the DOM), it is a bit bulky. However, we have
always had plans to move the data-mw attribute into its own bucket which
we might at some point in which case the size will be closer to the
current PHP parser output. If we moved page properties and other
metadata out, it will shrink it a little bit more.
For views that don't need to support editing or any other manipulation
or analyses, we can more aggressively strip more from the HTML without
affecting the rendering and get close to or even shrink the size below
the PHP parser output size (there might be use cases where that might be
appropriate thing to do). I could get this down to under 1M by stripping
rel attributes, element ids, and about ids for identifying template output.
But, for editing (not just in VE) use cases, because of additional
markup in place on the page (element ids, other markup for
transclusions, extensions, links, etc.), the output will probably be
somewhat larger than the corresponding PHP parser output. If we can keep
it under 1.1x of php parser output size, I think we are good.
I hope we can meet in the middle :-)
Please file bugs and continue to report things that get in the way of
using Parsoid.
Subbu.
[1]
https://www.mediawiki.org/wiki/Parsoid/Users
[2]
http://www.mediawiki.org/wiki/Parsoid/MediaWiki_DOM_spec