After a very informative exchange with Tim Starling, I've thought a bit more about my proposal last night about making Wikipedia cacheable, which in the light of day seems excessively complex. Here's a simpler version:
At the moment, according to Tim, all Wikipedia pages are served as uncacheable pages, thus preventing any intermediate proxy caches from cahing them -- a quick packet dump shows that they are served with
Cache-Control: private, s-maxage=0, max-age=0, must-revalidate
whether or not I am logged in. Clearly, if Wikipedia content was cacheable, there would be massive bandwidth gains, but the current policy is designed to prevent caches serving out-of-date content, or the same page to anons and logged-in users.
I'd be interested in the effects if we were to serve most ordinary pages with
Cache-Control: public, must-revalidate
with (say) a max-age of a week, except for the three following exceptions which need different or rapidly-changing data (I'll call it "dynamic content") served to different users for the same URL:
(a) pages for logged-in users (b) pages for anon users who have a pending message (c) pages with auto-generated dynamic content (Special: pages, and any others with similar behaviour)
which would be served with the anti-caching cache control header as before.
Since all pages would be must-revalidate, the Wikimedia cluster would still get a conditional GET request per hit, so that it could check freshness, then decide which header to generate, based on source IP and any user cookies. The twist would be that the page would be reported as outdated by the server if _either_ it had been changed since the cache stored it, _or_ dynamic content was needed, thus serving the desired dynamic content to those users who need it, whilst preventing that content from being cached for other users.
Since 95%+ of all hits are presumably from anons without pending messages, this should, in an ideal world, result in a very large number of pages being successfully served by hits on ISPs' proxy caches, without stopping dynamic content from being served to those users who need it, or affecting the freshness of pages for anons.
The hit rate would not be quite as high as possible, since every hit from a dynamic-content user would "wash out" any static version of the page in question from the cache, but since these users would only account for about 1 in 20 of page accesses, the remaining 19 out of 20 times there will still be a hit.
I'd be interested to hear what others think. Is there an obvious flaw in my reasoning? Is this worth a try?
-- Neil
-------------------------------------------------------------
Pseudo-code:
if (logged_in_user) or (user_has_messages) or (special_page): say has changed; serve with Cache-Control: private. must-revalidate else: if (modification_date > if_modified_since_date): say has changed; serve with Cache-Control: public. must-revalidate else: say has not changed
Neil Harris wrote:
After a very informative exchange with Tim Starling, I've thought a bit more about my proposal last night about making Wikipedia cacheable, which in the light of day seems excessively complex. Here's a simpler version:
[snip]
(a) pages for logged-in users (b) pages for anon users who have a pending message (c) pages with auto-generated dynamic content (Special: pages, and any others with similar behaviour)
[snip]
I'd be interested to hear what others think. Is there an obvious flaw in my reasoning? Is this worth a try?
I agree that this would be better, but implementing this has one problem: Squid 2.5 currently does not differentiate cache-control headers between its role as an accelerator/surrogate, and as a normal proxy cache. For this to work, Mediawiki should be able to send *two* kinds of cache control headers: one for our squids, and one for "the others out there".
Squid 3 (if it'll ever see the light of day) will support this through a X-Surrogate-Control header, but until then we're stuck with a hack: Mediawiki sends out Cache-Control headers for our squids, which then then try to recognize normal content pages and replace the Cache-Control header on those responses to:
Cache-Control: private, s-maxage=0, max-age=0, must-revalidate
We could of course do s/private/public/ here, but there may be pages that are private/user specific that Squid cannot really detect reliably, simply because it doesn't have the same detailed information that Mediawiki has. Therefor, 'private' is the safest way to go for now.
Another option would be to try to signal Squid when it should and when it should not replace the Cache-Control header sent by Mediawiki. However, Squid is not very flexible in this area, so it'd be a dirty hack.
Modifying Squid is of course possible as well, but requires a bit more effort. Additionally, itt has proven to be a nuisance to maintain a patched up Squid on our cluster, so we'd like to try to keep the set of patches against upstream as small as possible.
wikitech-l@lists.wikimedia.org