Il 01/08/2015 01:20, Subramanya Sastry ha scritto:
On 07/31/2015 12:55 PM, Ricordisamoa wrote:
Hi Subbu, thank you for this thoughtful insight.
And thank you for starting this thread. :-)
HTML is not a barrier by itself. The problem seems to be Parsoid being built primarily with VisualEditor in mind.
While we want the DOM to be VE-friendly, we definitely don't want the DOM to be VE-centric and that has been the intention from the very beginning. Flow, CX also use the Parsoid DOM for their functionality. There are other users too [1].
VE, Flow, CX all take advantage of HTML. And I can't make any sense out of editProtectedHelper.js https://en.wikipedia.org/wiki/User:Jackmcbarn/editProtectedHelper.js :'(
We definitely want Parsoid's output to be useful and usable more broadly as the canonical output representation of wikitext and are open to fixing whatever prevents that.
As Scott noted in the other email on the thread, inspired (and maybe challenged by :-) ) by mwparserfromhell's utilities, he has already whipped out a layer that provides an easier interface for manipulating the DOM.
It is not clear to me how can a single DOM serving both view and edit modes avoid redundancy.
You are right that there are some redundancies in information representation (because of having to serve multiple needs), but as far as I know, it is mostly around image attributes. If there is anything else specific (beyond image attributes) that is bothering you, can you flag that?
https://www.mediawiki.org/wiki/Parsoid/MediaWiki_DOM_spec#Transclusion_conte... All template parameters are in data-mw but not parsed. Parameters ending up in the 'final' wikitext are parsed separately.
I see huge demand for alternative wikignome-style editors. The more Parsoid's DOM is predictable, concise and documented, the more users you get.
I think Parsoid's DOM is predictable :-) but, can you say more about what prompted you to say that?
For example, to find images I have to search elements where typeof is one of mw:Image, mw:Image/Thumb, mw:Image/Frame, mw:Image/Frameless, then see if it's a figure or a span, and expect either a <figcaption> or data-mw accordingly. Add that the img tag's parent can be <a> or <span>... Instead, this is what I'd expect a proper structure to look like:
Image .src = title, internal or external link? .repository? .page = number or null .language = code or null .format = thumb etc. .caption = wikitext parsed recursively .link = internal or external link or null .size .original .width = 1234 .height = 4321 .specified .width = 2468 .computed .width = 2468 .height = 8642
As for documentation, we document the DOM we generate and its semantics here [2].
It seems that some sections need updates, e.g. noinclude / includeonly / onlyinclude https://www.mediawiki.org/wiki/Parsoid/MediaWiki_DOM_spec#noinclude_.2F_includeonly_.2F_onlyinclude
As for size, I just looked at the Barack Obama page and here are some size numbers.
By "concise" I meant an antonym for redundant, not lengthy :-)
1540407 /tmp/Barack_Obama.parsoid.html 1197318 /tmp/Barack_Obama.parsoid.no-data-mw.html 1045161 /tmp/Barack_Obama.php-parser.output.footer-stripped.html
Right now, because we inline template and other editable information (as inline JSON attributes of the DOM), it is a bit bulky. However, we have always had plans to move the data-mw attribute into its own bucket which we might at some point in which case the size will be closer to the current PHP parser output. If we moved page properties and other metadata out, it will shrink it a little bit more.
For views that don't need to support editing or any other manipulation or analyses, we can more aggressively strip more from the HTML without affecting the rendering
Stripping HTML altogether would be a huge step forward. :-)
and get close to or even shrink the size below the PHP parser output size (there might be use cases where that might be appropriate thing to do). I could get this down to under 1M by stripping rel attributes, element ids, and about ids for identifying template output.
But, for editing (not just in VE) use cases, because of additional markup in place on the page (element ids, other markup for transclusions, extensions, links, etc.), the output will probably be somewhat larger than the corresponding PHP parser output. If we can keep it under 1.1x of php parser output size, I think we are good.
I hope we can meet in the middle :-)
Please file bugs and continue to report things that get in the way of using Parsoid.
Subbu.
[1] https://www.mediawiki.org/wiki/Parsoid/Users [2] http://www.mediawiki.org/wiki/Parsoid/MediaWiki_DOM_spec
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l