+1
I think everything into Q3 looks like a good way to proceed forward. There might be an interesting division of labor on getting these things done (parsiod job handling, Cite extension rewrite, API batching). I'd be willing to help in areas I'd be useful in. I think this is ambitious, but the steps laid out look manageable by themselves. We will see how the target dates collide with reality, which may also depend on the level of interest.
I'd really like to see a reduction of CPU spent on refreshLinks jobs, so anything to help in that area is welcome. We currently rely on throwing more processes and hardware at the problem and using de-duplication to at least stop jobs from piling up (such as when heavily used templates keep getting edited before the previous jobs finish). De-duplication has it's own costs, and will make sense to move the queue of the main clusters. Managing these jobs is getting more difficult. In fact, it's the editing of a few templates that can account for a majority of the queue, where tens of thousands of entire pages are parsed because of some modest template change. I like the idea of storing dependency information in (or alongside) the HTML as metadata and using it to recompute only affected parts of the DOM.
There is certainly discussion to be had about the cleanest way to handle the trade-offs of when to store updated HTML for a revision (when a template/file changes or a magic word or DPL list should be re-calculated). It probably will not make sense for old revisions of pages. If we are storing new versions of HTML, it may make sense to purge the old ones from external storage if updates are frequent, though that interface has no deletion support and that is slightly against the philosophy of the external storage classes. It's probably not a big deal to change it though. I've also been told that the HTML tends to compress well, so we should not be looking at on order-of-magnitude text storage requirement increase (though maybe 4X or so from some quick tests). I'd like to see some documented statistics on this though, with samples.
I think the Visual Editor + HTML only method for third parties is interesting and could probably make use of ContentHandler well. I'm curious about the exact nature of HTML validation needed server-side for this setup, but from what I understand it would not be too complicated and the metadata could be handled in a way that does not require blind trust of the client.
-- View this message in context: http://wikimedia.7.n6.nabble.com/RFC-Parsoid-roadmap-tp4994503p4994870.html Sent from the Wikipedia Developers mailing list archive at Nabble.com.