Neil Harris wrote:
After a very informative exchange with Tim Starling, I've thought a bit more about my proposal last night about making Wikipedia cacheable, which in the light of day seems excessively complex. Here's a simpler version:
[snip]
(a) pages for logged-in users (b) pages for anon users who have a pending message (c) pages with auto-generated dynamic content (Special: pages, and any others with similar behaviour)
[snip]
I'd be interested to hear what others think. Is there an obvious flaw in my reasoning? Is this worth a try?
I agree that this would be better, but implementing this has one problem: Squid 2.5 currently does not differentiate cache-control headers between its role as an accelerator/surrogate, and as a normal proxy cache. For this to work, Mediawiki should be able to send *two* kinds of cache control headers: one for our squids, and one for "the others out there".
Squid 3 (if it'll ever see the light of day) will support this through a X-Surrogate-Control header, but until then we're stuck with a hack: Mediawiki sends out Cache-Control headers for our squids, which then then try to recognize normal content pages and replace the Cache-Control header on those responses to:
Cache-Control: private, s-maxage=0, max-age=0, must-revalidate
We could of course do s/private/public/ here, but there may be pages that are private/user specific that Squid cannot really detect reliably, simply because it doesn't have the same detailed information that Mediawiki has. Therefor, 'private' is the safest way to go for now.
Another option would be to try to signal Squid when it should and when it should not replace the Cache-Control header sent by Mediawiki. However, Squid is not very flexible in this area, so it'd be a dirty hack.
Modifying Squid is of course possible as well, but requires a bit more effort. Additionally, itt has proven to be a nuisance to maintain a patched up Squid on our cluster, so we'd like to try to keep the set of patches against upstream as small as possible.