On wikipedia-l, Brion wrote:
Please confirm that you have talked with Tim, who is already working on version tagging keeping our caching requirements in mind.
I thought I'd better describe here what I'm doing in that area, so that I can get some comments on the design, and in case any of it affects what other people are doing.
It started innocently enough, I ported Salvatore's "show verified" patch [1] to HEAD, with the intention of putting it live within a day or so. But the problem with that patch, like other features that encourage viewing of revisions other than the latest one, is that such page views are currently uncached. This is especially a problem for Salvatore's patch, which proposes to switch particularly popular articles and frequently vandalised articles to a moderated mode, where the old revision is displayed by default.
Our goals are: * Consistency between the parser cache and the rerendered HTML at any time * Fast cache hits * Low-latency cache updates
At the moment, rerendered HTML changes in the following circumstances: * Template change * Link colour change * Deletion of the revision
One simple way to reduce the rate at which the cache for old revisions is invalidated would be to retrieve the old revisions of all included templates when the revision is rerendered. In other words, to display all templates as they were when the article was edited. Besides the cache implications, this is also a highly desired UI feature. It turns out to be fairly easy to implement, with the following caveats:
* If a template is moved, there is no way to reliably determine where it moved to. We could follow the redirect in the first revision, but the redirect might have been changed by deletion and recreation by an admin. * Template deletion necessarily changes the rerendered HTML.
My original idea was to ignore these changes, and any link colour changes, in the interests of simplicity and cache efficiency. However Brion expressed a desire to see at least some part of MediaWiki work perfectly, and I admit that's a good point.
For caches of old revisions to properly reflect these changes described above, there are two design options that I can see:
1) Store a list of templates and links in the parser cache object, and check them all to make sure they still exist, on every parser cache hit. This would make parser cache hits slow, although they would be somewhat faster than rerendering.
2) Store a list of templates and links in the database, indexed both ways, in a similar way to what we do now with current revisions. Then analogously to the behaviour for current revisions, all revisions which include a template would have their rev_touched field updated when the template is deleted or moved.
Option 2 is the one I'm running with, because it's likely to be faster if carefully implemented. Note that template links from old revisions are only required when there is a valid cache object stored which might need to be invalidated. Thus we can reduce the size of the table by only registering links to tagged or current revisions.
This is about as far as my planning goes, I'm not quite sure of the details of this template tracking system. So in the following section I'm just thinking aloud.
* Should the new templatelinks table be indexed by (page,tag) or revision? If it's indexed by (page,tag), then we need to update all those rows when a tag changes. If it's indexed by revision, then we need to (periodically?) purge the table of all rows associated with untagged revisions. * Just how bad would it be if we indexed by revision and let the templatelinks table grow indefinitely? What would give out first? * Parser cache objects have a finite lifetime in memcached. Perhaps we could add a cache timestamp to templatelinks, and then periodically delete all templatelinks rows for which the parser cache object has expired. This has the advantage of allowing us to cache untagged old revisions, but I'm not sure what the DB performance implications would be.
So far, I've ported Salvatore's patch to HEAD, and refactored the link handling code, to track templates properly and to make it more flexible. None of it is committed yet.
-- Tim Starling
[1] http://mail.wikimedia.org/pipermail/wikitech-l/2005-July/030898.html
wikitech-l@lists.wikimedia.org