I took a quick peek at the sampled squid log and found that CSS and JS files together are eating a lot of bandwidth; together they make up about 20% of what's served:
https://wikitech.leuksman.com/view/Squid_bandwidth_breakdown
(May be inaccurate due to coding mistakes in my counter or weird dupe caching effects.)
Please forgive me if this is a dumb question, but if I check the headers returned for two successive requests, like so:
====================================================== root@bling:/var/www/hosts/mediawiki/wiki# curl --silent --include --head http://en.wikipedia.org/skins-1.5/monobook/main.css?55 HTTP/1.0 200 OK Date: Wed, 07 Feb 2007 02:56:31 GMT Server: Apache Cache-Control: max-age=2592000 Expires: Fri, 09 Mar 2007 02:56:31 GMT Last-Modified: Tue, 06 Feb 2007 20:04:40 GMT ETag: "60874b-709d-45c8df58" Accept-Ranges: bytes Content-Length: 28829 Content-Type: text/css Age: 2 X-Cache: HIT from sq30.wikimedia.org X-Cache-Lookup: HIT from sq30.wikimedia.org:80 Via: 1.0 sq30.wikimedia.org:80 (squid/2.6.STABLE9) Connection: close
root@bling:/var/www/hosts/mediawiki/wiki# curl --silent --include --head http://en.wikipedia.org/skins-1.5/monobook/main.css?55 HTTP/1.0 200 OK Date: Wed, 07 Feb 2007 02:56:26 GMT Server: Apache Cache-Control: max-age=2592000 Expires: Fri, 09 Mar 2007 02:56:26 GMT Last-Modified: Tue, 06 Feb 2007 20:04:40 GMT ETag: "15c02de-709d-45c8df58" Accept-Ranges: bytes Content-Length: 28829 Content-Type: text/css Age: 9 X-Cache: HIT from sq20.wikimedia.org X-Cache-Lookup: HIT from sq20.wikimedia.org:80 Via: 1.0 sq20.wikimedia.org:80 (squid/2.6.STABLE9) Connection: close
root@bling:/var/www/hosts/mediawiki/wiki# ======================================================
... then I have two questions:
1) Does it matter that the ETag varies between successive requests? Reason I ask is that the http://www.web-caching.com/mnot_tutorial/how.html page says: "HTTP 1.1 introduced a new kind of validator called the ETag. ETags are unique identifiers that are generated by the server and changed every time the object does. Because the server controls how the ETag is generated, caches can be surer that if the ETag matches when they make a If-None-Match request, the object really is the same."
I.e. if the ETag changes between requests, as it did in the above example, could that make requesters think that the object has changed too, thus reducing caching?
2) Would it help using "Cache-Control: max-age=2592000, public" instead of "Cache-Control: max-age=2592000" ? Public is defined as "marks the response as cacheable, even if it would normally be uncacheable. For instance, if your pages are authenticated, the public directive makes them cacheable." I.e. I'm not sure if the Wikipedia cookie is being treated as authentication for the purposes of this definition, but if it is, caching the site-wide CSS or JS seems unlikely to hurt (since it really is "public") - but obviously caching the user-specific CSS or JS would be bad.
It should be possible to serve these files compressed through Apache with mod_gzip set up, which should squish them by probably 2/3.
Last time I looked mod_gzip seemed to be losing favour somewhat - the new "in" compression method for apache2 seems to be mod_deflate ( http://httpd.apache.org/docs/2.0/mod/mod_deflate.html ), partially because it's an apache module, rather than a 3rd-party module; however it got a few percentage points less efficient compression than mod_gzip, but the suggestion was that it caused less CPU load to do its compression. It was about a year ago that I looked at this stuff though, so the state of play may have changed since.
All the best, Nick.