On wikipedia-l, Brion wrote:
Please confirm that you have talked with Tim, who is
already working on version
tagging keeping our caching requirements in mind.
I thought I'd better describe here what I'm doing in that area, so that I
can get some comments on the design, and in case any of it affects what
other people are doing.
It started innocently enough, I ported Salvatore's "show verified" patch
[1]
to HEAD, with the intention of putting it live within a day or so. But the
problem with that patch, like other features that encourage viewing of
revisions other than the latest one, is that such page views are currently
uncached. This is especially a problem for Salvatore's patch, which proposes
to switch particularly popular articles and frequently vandalised articles
to a moderated mode, where the old revision is displayed by default.
Our goals are:
* Consistency between the parser cache and the rerendered HTML at any time
* Fast cache hits
* Low-latency cache updates
At the moment, rerendered HTML changes in the following circumstances:
* Template change
* Link colour change
* Deletion of the revision
One simple way to reduce the rate at which the cache for old revisions is
invalidated would be to retrieve the old revisions of all included templates
when the revision is rerendered. In other words, to display all templates as
they were when the article was edited. Besides the cache implications, this
is also a highly desired UI feature. It turns out to be fairly easy to
implement, with the following caveats:
* If a template is moved, there is no way to reliably determine where it
moved to. We could follow the redirect in the first revision, but the
redirect might have been changed by deletion and recreation by an admin.
* Template deletion necessarily changes the rerendered HTML.
My original idea was to ignore these changes, and any link colour changes,
in the interests of simplicity and cache efficiency. However Brion expressed
a desire to see at least some part of MediaWiki work perfectly, and I admit
that's a good point.
For caches of old revisions to properly reflect these changes described
above, there are two design options that I can see:
1) Store a list of templates and links in the parser cache object, and check
them all to make sure they still exist, on every parser cache hit. This
would make parser cache hits slow, although they would be somewhat faster
than rerendering.
2) Store a list of templates and links in the database, indexed both ways,
in a similar way to what we do now with current revisions. Then analogously
to the behaviour for current revisions, all revisions which include a
template would have their rev_touched field updated when the template is
deleted or moved.
Option 2 is the one I'm running with, because it's likely to be faster if
carefully implemented. Note that template links from old revisions are only
required when there is a valid cache object stored which might need to be
invalidated. Thus we can reduce the size of the table by only registering
links to tagged or current revisions.
This is about as far as my planning goes, I'm not quite sure of the details
of this template tracking system. So in the following section I'm just
thinking aloud.
* Should the new templatelinks table be indexed by (page,tag) or revision?
If it's indexed by (page,tag), then we need to update all those rows when a
tag changes. If it's indexed by revision, then we need to (periodically?)
purge the table of all rows associated with untagged revisions.
* Just how bad would it be if we indexed by revision and let the
templatelinks table grow indefinitely? What would give out first?
* Parser cache objects have a finite lifetime in memcached. Perhaps we could
add a cache timestamp to templatelinks, and then periodically delete all
templatelinks rows for which the parser cache object has expired. This has
the advantage of allowing us to cache untagged old revisions, but I'm not
sure what the DB performance implications would be.
So far, I've ported Salvatore's patch to HEAD, and refactored the link
handling code, to track templates properly and to make it more flexible.
None of it is committed yet.
-- Tim Starling
[1]
http://mail.wikimedia.org/pipermail/wikitech-l/2005-July/030898.html