Re: [Wikitech-l] I love Parsoid but it doesn't want me

1 Aug 2015


      Il 01/08/2015 01:20, Subramanya Sastry ha scritto:
...
On 07/31/2015 12:55 PM, Ricordisamoa wrote:
...
Hi Subbu,
thank you for this thoughtful insight.
And thank you for starting this thread. :-)
...
HTML is not a barrier by itself. The problem seems to be Parsoid 
being built primarily with VisualEditor in mind.
While we want the DOM to be VE-friendly, we definitely don't want the 
DOM to be VE-centric and that has been the intention from the very 
beginning. Flow, CX also use the Parsoid DOM for their functionality. 
There are other users too [1].
VE, Flow, CX all take advantage of HTML. And I can't make any sense out 
of editProtectedHelper.js 
https://en.wikipedia.org/wiki/User:Jackmcbarn/editProtectedHelper.js :'(
...
We definitely want Parsoid's output to be useful and usable more 
broadly as the canonical output representation of wikitext and are 
open to fixing whatever prevents that.
As Scott noted in the other email on the thread, inspired (and maybe 
challenged by :-) ) by mwparserfromhell's utilities, he has already 
whipped out a layer that provides an easier interface for manipulating 
the DOM.
...
It is not clear to me how can a single DOM serving both view and edit 
modes avoid redundancy.
You are right that there are some redundancies in information 
representation (because of having to serve multiple needs), but as far 
as I know, it is mostly around image attributes. If there is anything 
else specific (beyond image attributes) that is bothering you, can you 
flag that?
https://www.mediawiki.org/wiki/Parsoid/MediaWiki_DOM_spec#Transclusion_conte...
All template parameters are in data-mw but not parsed. Parameters ending 
up in the 'final' wikitext are parsed separately.
...
...
I see huge demand for alternative wikignome-style editors. The more 
Parsoid's DOM is predictable, concise and documented, the more users 
you get.
I think Parsoid's DOM is predictable :-) but, can you say more about 
what prompted you to say that?
For example, to find images I have to search elements where typeof is 
one of mw:Image, mw:Image/Thumb, mw:Image/Frame, mw:Image/Frameless, 
then see if it's a figure or a span, and expect either a <figcaption> or 
data-mw accordingly. Add that the img tag's parent can be <a> or <span>...
Instead, this is what I'd expect a proper structure to look like:
Image
.src = title, internal or external link?
.repository?
.page = number or null
.language = code or null
.format = thumb etc.
.caption = wikitext parsed recursively
.link = internal or external link or null
.size
  .original
   .width = 1234
   .height = 4321
  .specified
   .width = 2468
  .computed
   .width = 2468
   .height = 8642
...
As for documentation, we document the DOM we generate and its 
semantics here [2].
It seems that some sections need updates, e.g. noinclude / includeonly / 
onlyinclude 
https://www.mediawiki.org/wiki/Parsoid/MediaWiki_DOM_spec#noinclude_.2F_includeonly_.2F_onlyinclude
...
As for size, I just looked at the Barack Obama page and here are some 
size numbers.
By "concise" I meant an antonym for redundant, not lengthy :-)
...
1540407 /tmp/Barack_Obama.parsoid.html
1197318 /tmp/Barack_Obama.parsoid.no-data-mw.html
1045161 /tmp/Barack_Obama.php-parser.output.footer-stripped.html
Right now, because we inline template and other editable information 
(as inline JSON attributes of the DOM), it is a bit bulky. However, we 
have always had plans to move the data-mw attribute into its own 
bucket which we might at some point in which case the size will be 
closer to the current PHP parser output. If we moved page properties 
and other metadata out, it will shrink it a little bit more.
For views that don't need to support editing or any other manipulation 
or analyses, we can more aggressively strip more from the HTML without 
affecting the rendering
Stripping HTML altogether would be a huge step forward. :-)
...
and get close to or even shrink the size below the PHP parser output 
size (there might be use cases where that might be appropriate thing 
to do). I could get this down to under 1M by stripping rel attributes, 
element ids, and about ids for identifying template output.
But, for editing (not just in VE) use cases, because of additional 
markup in place on the page (element ids, other markup for 
transclusions, extensions, links, etc.), the output will probably be 
somewhat larger than the corresponding PHP parser output. If we can 
keep it under 1.1x of php parser output size, I think we are good.
...
I hope we can meet in the middle :-)
Please file bugs and continue to report things that get in the way of 
using Parsoid.
Subbu.
[1] https://www.mediawiki.org/wiki/Parsoid/Users
[2] http://www.mediawiki.org/wiki/Parsoid/MediaWiki_DOM_spec

Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] I love Parsoid but it doesn't want me