Hey Aaron,
thanks for your thoughts! You evidently kicked off the discussion ;)
There
might be an interesting division of labor on getting these things done
(parsiod job handling, Cite extension rewrite, API batching). I'd be willing
to help in areas I'd be useful in.
Awesome!
I think this is ambitious, but the steps
laid out look manageable by themselves. We will see how the target dates
collide with reality, which may also depend on the level of interest.
Indeed. We have hard deadlines for the features needed by the
VisualEditor, so the architectural work might be slowed down a bit if
that gets tight. Conversion to HTML on save and HTML storage are
important for user-perceived editing performance though, so it is fairly
high priority.
There is certainly discussion to be had about the
cleanest way to handle the
trade-offs of when to store updated HTML for a revision (when a
template/file changes or a magic word or DPL list should be re-calculated).
It probably will not make sense for old revisions of pages. If we are
storing new versions of HTML, it may make sense to purge the old ones from
external storage if updates are frequent, though that interface has no
deletion support and that is slightly against the philosophy of the external
storage classes. It's probably not a big deal to change it though. I've also
been told that the HTML tends to compress well, so we should not be looking
at on order-of-magnitude text storage requirement increase (though maybe 4X
or so from some quick tests). I'd like to see some documented statistics on
this though, with samples.
We will definitely do some statistics on this, and will discuss the
storage strategy before starting implementation. Right now we are still
researching the implementation options, should have more clue next week.
I think the Visual Editor + HTML only method for third
parties is
interesting and could probably make use of ContentHandler well.
The ContentHandler angle is something I have also been wondering about.
For pure HTML wikis this should work as designed, with a single
(HTML/RDFa) content model assigned per revision. For mixed wikis storing
both HTML and wikitext however we need to support different content
models (wikitext and HTML/RDFa) for each revision. Those two are
isomorphic, but are handled differently. If there is interest in
supporting multiple content models per revision within the
ContentHandler framework, then now would probably be a good time to work
that out.
In any case, it seems to be a good idea to use the existing text storage
logic in revision including its support for compression and external
storage.
I'm curious
about the exact nature of HTML validation needed server-side for this setup,
but from what I understand it would not be too complicated and the metadata
could be handled in a way that does not require blind trust of the client.
Currently Parsoid converts each edited HTML document to wikitext, and
then re-parses that wikitext while sanitizing attributes and tags with a
port of the PHP Sanitizer class.
Before we can store the HTML DOM edited by a client directly, we will
need to rework sanitation to work on the DOM, and preferably also
perform as much of the work on the way in instead of on the way out.
Metadata embedded in the DOM beyond regular HTML can be divided into two
categories: Public RDFa-based structures and private round-trip data.
Public RDFa structures will need more solid verification, but are
otherwise pretty straightforward (see the spec at [1]). We plan to move
private round-trip data out of the DOM, which would prevent clients from
messing with it. We will probably use some unique id attributes to aid
the association of nodes with their metadata, but might also be able to
get away without such ids by using a subtree hashing similar to the one
described in XyDiff [2]. XyDiff would also be an improvement over the
simplistic DOM diff algorithm we currently use for change detection.
Gabriel
[1]:
http://www.mediawiki.org/wiki/Parsoid/MediaWiki_DOM_spec
[2]:
http://gregory.cobena.free.fr/www/Publications/%5BICDE2002%5D%20XyDiff%20-%…
--
Gabriel Wicke
Senior Software Engineer
Wikimedia Foundation