Hey Aaron,
thanks for your thoughts! You evidently kicked off the discussion ;)
There might be an interesting division of labor on getting these things done (parsiod job handling, Cite extension rewrite, API batching). I'd be willing to help in areas I'd be useful in.
Awesome!
I think this is ambitious, but the steps laid out look manageable by themselves. We will see how the target dates collide with reality, which may also depend on the level of interest.
Indeed. We have hard deadlines for the features needed by the VisualEditor, so the architectural work might be slowed down a bit if that gets tight. Conversion to HTML on save and HTML storage are important for user-perceived editing performance though, so it is fairly high priority.
There is certainly discussion to be had about the cleanest way to handle the trade-offs of when to store updated HTML for a revision (when a template/file changes or a magic word or DPL list should be re-calculated). It probably will not make sense for old revisions of pages. If we are storing new versions of HTML, it may make sense to purge the old ones from external storage if updates are frequent, though that interface has no deletion support and that is slightly against the philosophy of the external storage classes. It's probably not a big deal to change it though. I've also been told that the HTML tends to compress well, so we should not be looking at on order-of-magnitude text storage requirement increase (though maybe 4X or so from some quick tests). I'd like to see some documented statistics on this though, with samples.
We will definitely do some statistics on this, and will discuss the storage strategy before starting implementation. Right now we are still researching the implementation options, should have more clue next week.
I think the Visual Editor + HTML only method for third parties is interesting and could probably make use of ContentHandler well.
The ContentHandler angle is something I have also been wondering about. For pure HTML wikis this should work as designed, with a single (HTML/RDFa) content model assigned per revision. For mixed wikis storing both HTML and wikitext however we need to support different content models (wikitext and HTML/RDFa) for each revision. Those two are isomorphic, but are handled differently. If there is interest in supporting multiple content models per revision within the ContentHandler framework, then now would probably be a good time to work that out.
In any case, it seems to be a good idea to use the existing text storage logic in revision including its support for compression and external storage.
I'm curious about the exact nature of HTML validation needed server-side for this setup, but from what I understand it would not be too complicated and the metadata could be handled in a way that does not require blind trust of the client.
Currently Parsoid converts each edited HTML document to wikitext, and then re-parses that wikitext while sanitizing attributes and tags with a port of the PHP Sanitizer class.
Before we can store the HTML DOM edited by a client directly, we will need to rework sanitation to work on the DOM, and preferably also perform as much of the work on the way in instead of on the way out.
Metadata embedded in the DOM beyond regular HTML can be divided into two categories: Public RDFa-based structures and private round-trip data. Public RDFa structures will need more solid verification, but are otherwise pretty straightforward (see the spec at [1]). We plan to move private round-trip data out of the DOM, which would prevent clients from messing with it. We will probably use some unique id attributes to aid the association of nodes with their metadata, but might also be able to get away without such ids by using a subtree hashing similar to the one described in XyDiff [2]. XyDiff would also be an improvement over the simplistic DOM diff algorithm we currently use for change detection.
Gabriel
[1]: http://www.mediawiki.org/wiki/Parsoid/MediaWiki_DOM_spec [2]: http://gregory.cobena.free.fr/www/Publications/%5BICDE2002%5D%20XyDiff%20-%2...