After a very informative exchange with Tim Starling, I've thought a bit more about my proposal last night about making Wikipedia cacheable, which in the light of day seems excessively complex. Here's a simpler version:
At the moment, according to Tim, all Wikipedia pages are served as uncacheable pages, thus preventing any intermediate proxy caches from cahing them -- a quick packet dump shows that they are served with
Cache-Control: private, s-maxage=0, max-age=0, must-revalidate
whether or not I am logged in. Clearly, if Wikipedia content was cacheable, there would be massive bandwidth gains, but the current policy is designed to prevent caches serving out-of-date content, or the same page to anons and logged-in users.
I'd be interested in the effects if we were to serve most ordinary pages with
Cache-Control: public, must-revalidate
with (say) a max-age of a week, except for the three following exceptions which need different or rapidly-changing data (I'll call it "dynamic content") served to different users for the same URL:
(a) pages for logged-in users (b) pages for anon users who have a pending message (c) pages with auto-generated dynamic content (Special: pages, and any others with similar behaviour)
which would be served with the anti-caching cache control header as before.
Since all pages would be must-revalidate, the Wikimedia cluster would still get a conditional GET request per hit, so that it could check freshness, then decide which header to generate, based on source IP and any user cookies. The twist would be that the page would be reported as outdated by the server if _either_ it had been changed since the cache stored it, _or_ dynamic content was needed, thus serving the desired dynamic content to those users who need it, whilst preventing that content from being cached for other users.
Since 95%+ of all hits are presumably from anons without pending messages, this should, in an ideal world, result in a very large number of pages being successfully served by hits on ISPs' proxy caches, without stopping dynamic content from being served to those users who need it, or affecting the freshness of pages for anons.
The hit rate would not be quite as high as possible, since every hit from a dynamic-content user would "wash out" any static version of the page in question from the cache, but since these users would only account for about 1 in 20 of page accesses, the remaining 19 out of 20 times there will still be a hit.
I'd be interested to hear what others think. Is there an obvious flaw in my reasoning? Is this worth a try?
-- Neil
-------------------------------------------------------------
Pseudo-code:
if (logged_in_user) or (user_has_messages) or (special_page): say has changed; serve with Cache-Control: private. must-revalidate else: if (modification_date > if_modified_since_date): say has changed; serve with Cache-Control: public. must-revalidate else: say has not changed