RFC: Parsoid roadmap

List overview All Threads
Download

newer

older

FOSDEM presentation - feedback...

Measuring the load-impact of query...

Gabriel Wicke

24 Jan 2013 24 Jan '13

12:10 a.m.

Fellow MediaWiki hackers! After the pretty successful December release and some more clean-up work following up on that we are now considering the next steps for Parsoid. To this end, we have put together a rough roadmap for the Parsoid project at https://www.mediawiki.org/wiki/Parsoid/Roadmap The main areas we plan to work on in the next months are: Performance improvements: Loading a large wiki page through Parsoid into VisualEditor can currently take over 30 seconds. We want to make this instantaneous by generating and storing the HTML after each edit. This requires a throughput that can keep up with the edit rates on major wikipedias (~10 Hz on enwiki). Features and refinement: Localization support will enable the use of Parsoid on non-English wikipedias. VisualEditor needs editing support for more content elements including template parameters and extension tags. As usual, we will also continue to refine Parsoid's compatibility in round-trip testing and parserTests. Apart from these main tasks closely connected to supporting the VisualEditor, we also need to look at the longer-term Parsoid and MediaWiki strategy. Better support for visual editing and smarter caching in MediaWiki's templating facilities is one area we plan to look at. We also would like to make it easy to use the VisualEditor on small mediawiki installations by removing the need to run a separate Parsoid service. A general theme is pushing some of Parsoid's innovations back into MediaWiki core. The clean and information-rich HTML-based content model in particular opens up several attractive options which are discussed in detail in the roadmap. Please review the roadmap and let us know what you think! Gabriel and the Parsoid team -- Gabriel Wicke Senior Software Engineer Wikimedia Foundation

Show replies by date

Aaron Schulz

30 Jan 30 Jan

1:17 a.m.

+1 I think everything into Q3 looks like a good way to proceed forward. There might be an interesting division of labor on getting these things done (parsiod job handling, Cite extension rewrite, API batching). I'd be willing to help in areas I'd be useful in. I think this is ambitious, but the steps laid out look manageable by themselves. We will see how the target dates collide with reality, which may also depend on the level of interest. I'd really like to see a reduction of CPU spent on refreshLinks jobs, so anything to help in that area is welcome. We currently rely on throwing more processes and hardware at the problem and using de-duplication to at least stop jobs from piling up (such as when heavily used templates keep getting edited before the previous jobs finish). De-duplication has it's own costs, and will make sense to move the queue of the main clusters. Managing these jobs is getting more difficult. In fact, it's the editing of a few templates that can account for a majority of the queue, where tens of thousands of entire pages are parsed because of some modest template change. I like the idea of storing dependency information in (or alongside) the HTML as metadata and using it to recompute only affected parts of the DOM. There is certainly discussion to be had about the cleanest way to handle the trade-offs of when to store updated HTML for a revision (when a template/file changes or a magic word or DPL list should be re-calculated). It probably will not make sense for old revisions of pages. If we are storing new versions of HTML, it may make sense to purge the old ones from external storage if updates are frequent, though that interface has no deletion support and that is slightly against the philosophy of the external storage classes. It's probably not a big deal to change it though. I've also been told that the HTML tends to compress well, so we should not be looking at on order-of-magnitude text storage requirement increase (though maybe 4X or so from some quick tests). I'd like to see some documented statistics on this though, with samples. I think the Visual Editor + HTML only method for third parties is interesting and could probably make use of ContentHandler well. I'm curious about the exact nature of HTML validation needed server-side for this setup, but from what I understand it would not be too complicated and the metadata could be handled in a way that does not require blind trust of the client. -- View this message in context: http://wikimedia.7.n6.nabble.com/RFC-Parsoid-roadmap-tp4994503p4994870.html Sent from the Wikipedia Developers mailing list archive at Nabble.com.

Gabriel Wicke

6:32 p.m.

Hey Aaron, thanks for your thoughts! You evidently kicked off the discussion ;)

...

There might be an interesting division of labor on getting these things done (parsiod job handling, Cite extension rewrite, API batching). I'd be willing to help in areas I'd be useful in.

Awesome!

...

I think this is ambitious, but the steps laid out look manageable by themselves. We will see how the target dates collide with reality, which may also depend on the level of interest.

Indeed. We have hard deadlines for the features needed by the VisualEditor, so the architectural work might be slowed down a bit if that gets tight. Conversion to HTML on save and HTML storage are important for user-perceived editing performance though, so it is fairly high priority.

...

There is certainly discussion to be had about the cleanest way to handle the trade-offs of when to store updated HTML for a revision (when a template/file changes or a magic word or DPL list should be re-calculated). It probably will not make sense for old revisions of pages. If we are storing new versions of HTML, it may make sense to purge the old ones from external storage if updates are frequent, though that interface has no deletion support and that is slightly against the philosophy of the external storage classes. It's probably not a big deal to change it though. I've also been told that the HTML tends to compress well, so we should not be looking at on order-of-magnitude text storage requirement increase (though maybe 4X or so from some quick tests). I'd like to see some documented statistics on this though, with samples.

We will definitely do some statistics on this, and will discuss the storage strategy before starting implementation. Right now we are still researching the implementation options, should have more clue next week.

...

I think the Visual Editor + HTML only method for third parties is interesting and could probably make use of ContentHandler well.

The ContentHandler angle is something I have also been wondering about. For pure HTML wikis this should work as designed, with a single (HTML/RDFa) content model assigned per revision. For mixed wikis storing both HTML and wikitext however we need to support different content models (wikitext and HTML/RDFa) for each revision. Those two are isomorphic, but are handled differently. If there is interest in supporting multiple content models per revision within the ContentHandler framework, then now would probably be a good time to work that out. In any case, it seems to be a good idea to use the existing text storage logic in revision including its support for compression and external storage.

...

I'm curious about the exact nature of HTML validation needed server-side for this setup, but from what I understand it would not be too complicated and the metadata could be handled in a way that does not require blind trust of the client.

Currently Parsoid converts each edited HTML document to wikitext, and then re-parses that wikitext while sanitizing attributes and tags with a port of the PHP Sanitizer class. Before we can store the HTML DOM edited by a client directly, we will need to rework sanitation to work on the DOM, and preferably also perform as much of the work on the way in instead of on the way out. Metadata embedded in the DOM beyond regular HTML can be divided into two categories: Public RDFa-based structures and private round-trip data. Public RDFa structures will need more solid verification, but are otherwise pretty straightforward (see the spec at [1]). We plan to move private round-trip data out of the DOM, which would prevent clients from messing with it. We will probably use some unique id attributes to aid the association of nodes with their metadata, but might also be able to get away without such ids by using a subtree hashing similar to the one described in XyDiff [2]. XyDiff would also be an improvement over the simplistic DOM diff algorithm we currently use for change detection. Gabriel [1]: http://www.mediawiki.org/wiki/Parsoid/MediaWiki_DOM_spec [2]: http://gregory.cobena.free.fr/www/Publications/%5BICDE2002%5D%20XyDiff%20-%… -- Gabriel Wicke Senior Software Engineer Wikimedia Foundation

Ariel T. Glenn

7:36 a.m.

Στις 23-01-2013, ημέρα Τετ, και ώρα 15:10 -0800, ο/η Gabriel Wicke έγραψε:

...

On thing that jumped out at me is this: "We have also decided to narrow our focus a bit by continuing to use the PHP preprocessor to perform our template expansion." While I understand the reasoning and even sympathize with it, I had hoped that Parsoid, when complete, would facilitate the implementation of wikitext parsers apart from the canonical parser (i.e. MediaWiki), with clearly defined behavior for the language including templates. Is that idea dead then? Ariel

Subramanya Sastry

5:39 p.m.

On 01/30/2013 12:36 AM, Ariel T. Glenn wrote:

...

Στις 23-01-2013, ημέρα Τετ, και ώρα 15:10 -0800, ο/η Gabriel Wicke έγραψε:

As it exists, Parsoid can tackle full template expansion -- but, since it does not support all parser functions natively, this is still incomplete, and we can bypass the need for the most part by relying on the PHP preprocessor to give us fully expanded wikitext which we process further. We are refocusing our efforts towards exploring HTML-based templating -- while supporting existing templates. Lua based templates already clean up a lot of template logic by have access to full conditional logic. By relying more on DOM-based templates (which would also be editable in a visual-editor like client), the expectation is that direct wikitext use itself will progressively diminish. Since most wikitext and probably Lua templates already return well-formed DOM (not all do), by simply adding a parse layer on top of them, they can be supported in a DOM-only templating framework. So, the first outcome of this effort would be to require templates to return DOM fragment always. In such a diminished-use scenario, we do not see the need for focusing a lot of energy and effort in attaining full compatibility entirely in Parsoid. We see Parsoid+PHP parser as providing legacy wikitext support while a large chunk of editing and storage happens in the HTML world. We can then take it from there based on how far this strategy takes us. If there still remains a need for a full replacement wikitext evaluation system to be in place (because of continuing popularity of wikitext or because of performance reasons or whatever else), that option remains open and is not closed at this time. Even so, there is still possibility of identifying "erroneous" or "undefined behavior" wikitext markup within Parsoid (in quotes, because anything that is thrown at the php parser and parsoid needs to be rendered always). We can detect, for example, missing opening/closing html tags (since we currently have to do that for roundtripping them properly without introducing dirty diffs), detect unbalanced tags in certain contexts by treating them as balanced-DOM contexts (image captions, extensions), and other such scenarios. We also have been adding a number of parser tests that try to specify edge case behaviors and make a call as to whether it is legitimate behavior or undefined behavior. All of this could be used in some mode to issue warnings in some lint-like mode, which can then serve to be a de-facto definition of legitimate wikitext since there is no possibility of a grammar-based definition for wikitext. So, while we are not focusing on attaining full replacement capability in Parsoid, our new directions do not entirely do away with the idea that you alluded to: (1) we are attempting to move towards templates (DOM/Lua/wikitext) that can only return DOM fragments (2) we retain the ability to provide some kind of linting ability in Parsoid (but this functionality is not at the top of our todo list since we are focused on reducing the scope of wikitext use over the long-term, while providing full compatibility in the immediate and short-term). Does that answer your question? Subbu. PS: The other primary reason for going with a new wikitext evaluator/runtime (more accurate than calling this a parser), possibly in c++, was performance -- but we are going at it in a different way already based on the notion that most edits on wiki pages are going to be "minor" edits (relative to the size of the page). If so, there is no sense in fully serializing it and fully reparsing it on every such minor edit -- it is a waste of server resources. Since we now have a fully RT-able HTML representation of wikitext, selective serialization (HTML->wikitext) selective reparsing (of wikitext-based edits that happen outside the VE), along with caching of DOM fragments (transclusions, etc) should take care of the performance issue -- these are addressed in the RFC.

Brad Jorsch

6:12 p.m.

On Wed, Jan 30, 2013 at 11:39 AM, Subramanya Sastry <ssastry(a)wikimedia.org> wrote:

...

(1) we are attempting to move towards templates (DOM/Lua/wikitext) that can only return DOM fragments

You may want to run that idea past the communities first. I imagine they'll get upset if you break widely-used templates like just about anything in [[en:Category:Archival templates]] whose name ends with "top" or "bottom", considering that the obvious "fix" quickly runs into various parser limits (e.g. Template argument size).

Gabriel Wicke

6:44 p.m.

On 01/30/2013 09:12 AM, Brad Jorsch wrote:

...

On Wed, Jan 30, 2013 at 11:39 AM, Subramanya Sastry <ssastry(a)wikimedia.org> wrote:

(1) we are attempting to move towards templates (DOM/Lua/wikitext) that can only return DOM fragments

Brad, we still support unbalanced templates, but plan to enforce nesting of DOM blocks made up of combinations of these templates. Currently a table produced by unbalanced table start / row / end templates is encapsulated as a single template-affected DOM fragment. We plan to enforce the nesting of such a compound fragment, but not of individual templates that are part of it. This means that these unbalanced templates continue to work as expected, but can be re-expanded as a compound in a well-defined branch of the page DOM. For the vast majority of templates that do not emit unbalanced output however we can directly enforce proper nesting per template without breaking current behavior. Gabriel -- Gabriel Wicke Senior Software Engineer Wikimedia Foundation

Denny Vrandečić

12:47 p.m.

Thank you for the Roadmap, Gabriel! It is some exciting and interesting stuff inside. I am really happy that the roadmap would allow us this year to highly optimize Wikidata-related changes on the Wikipedias, i.e. we would not need to reparse the whole page when some data in Wikidata changes, and could thus possibly afford to increase the currentness of all language editions. That would be awesome -- for now I was always assuming that the Wikipedia articles would only be updated on the next purge, whenever that happens. We could optimize that based on the work of your team. Thanks! Cheers, Denny 2013/1/24 Gabriel Wicke <gwicke(a)wikimedia.org>

...

-- Project director Wikidata Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin Tel. +49-30-219 158 26-0 | http://wikimedia.de Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.

4104

days inactive

4111

days old

wikitech-l@lists.wikimedia.org

Manage subscription

7 comments

6 participants

tags (0)

participants (6)

Aaron Schulz
Ariel T. Glenn
Brad Jorsch
Denny Vrandečić
Gabriel Wicke
Subramanya Sastry