Neil Harris wrote:
I'm not suggesting the log files should be polled in real-time: that would be silly.
That was my understanding also. I mentioned polling in the context of constructing log files, not reading them.
Rather, the URL invalidation file needs to be pulled down only once per batch pre-fetch session, a total of perhaps 6 Mbytes per day for each client site, assuming 60 chars per log entry and 100,000 edits a day. This assumes no compression: this file should gzip extremely efficiently due to the strong repetition in both URLs and timestamp strings, so the file might be only 2 or 3 Mbytes long if dowloaded with gzip on-the-fly compression.
For a remote site, Wikipedia traffic will only form a small proportion of overall traffic. A cache issuing an HTTP HEAD request to check freshness will return a tiny fraction of the number of bytes that an HTTP GET will: if the data is stale, a full GET will be required, regardless of whether the cache invalidation is done on-demand at page fetch time or by real-time streaming.
Caches use ICP, not HTTP HEAD. The only clients that use HEAD are link checkers. Browsers use a GET method with a Last-Modified header.
The advantage of the streaming approach is (as I understand it)
- to eliminate the need for HEAD requests
No, intermediate caches can't be relied on to do that kind of revalidation, only browsers can. Wikipedia sends "Cache-Control: private,must-revalidate" which disables intermediate caches entirely. The point of the streaming approach is to allow an intermediate cache at all, that's why it was developed concurrently with our initial deployment of squid.
Maybe implementing the kind of revalidation you're talking about would be a useful step.
- to push some of the GET requests into the off-peak time, which is a
win providing the page is not touched between the off-peak fetch and a user's on-peak access to the same page (but a loss if the page is edited between then and the user access, as it simply wastes off-peak bandwidth to no useful effect).
[...]
That's ancillary, and it hasn't been developed yet.
-- Tim Starling