Tim Starling wrote:
The pre-fetch component is a batch operation, but the cache clear isn't. When a user edits a page, we expect everyone in the world to be able to retrieve it within a second or two. That's why we already have apaches in two countries pumping out UDP packets which are routed by various means to all our squid servers. When someone makes an edit, the worldwide cache of that page is instantly purged, and the main point of my proposal is an automated method for ISP proxies to be part of that.
We could convert the CLR stream to ASCII text, but the software to do that could equally run on every individual squid, since they all have access to that stream. Either way, we need to put a flag in the CLR packets indicating whether the item should be pre-fetched or not.
Polling the recentchanges table to construct these ASCII files isn't really an option, we tried that with the RC->IRC bots and it turned out to be too resource-intensive, which is why they also use a UDP stream these days. We could add a third UDP stream for this purpose, or create a daemon and have TCP notification, but it seems easier to me to just modify one of the existing ones slightly, by adding a flag. The REASON field has a 4-bit width, and only two codes are defined, so there's plenty of room for us to suggest our own codes. Otherwise we can just stake a claim in the RESERVED section.
-- Tim Starling
I'm not suggesting the log files should be polled in real-time: that would be silly. Rather, the URL invalidation file needs to be pulled down only once per batch pre-fetch session, a total of perhaps 6 Mbytes per day for each client site, assuming 60 chars per log entry and 100,000 edits a day. This assumes no compression: this file should gzip extremely efficiently due to the strong repetition in both URLs and timestamp strings, so the file might be only 2 or 3 Mbytes long if dowloaded with gzip on-the-fly compression.
For a remote site, Wikipedia traffic will only form a small proportion of overall traffic. A cache issuing an HTTP HEAD request to check freshness will return a tiny fraction of the number of bytes that an HTTP GET will: if the data is stale, a full GET will be required, regardless of whether the cache invalidation is done on-demand at page fetch time or by real-time streaming.
The advantage of the streaming approach is (as I understand it) * to eliminate the need for HEAD requests * to push some of the GET requests into the off-peak time, which is a win providing the page is not touched between the off-peak fetch and a user's on-peak access to the same page (but a loss if the page is edited between then and the user access, as it simply wastes off-peak bandwidth to no useful effect).
For the Wikipedia cluster, which spends 100% of its time serving Wikipedia content, the multicast HTCP solution is clearly the best solution to the problem.
For another site, the question will be whether the expected cost and bandwidth gain from the advantages of the streaming system will outweigh the cost, complexity and bandwidth loss of installing special software and subscribing to the HTCP stream at all times during the day.
It would make sense to do some modelling of the cache hit/miss patterns of off-peak prefetching before going ahead and implementing anything. For a simple model, I think it would be reasonable to expect a Zipf distribution pattern of access rates for individual pages on both the read and edit sides of the system, and to consider the gains/losses for pages updated in a Poisson fashion at various data rates, and what proportion of the traffic is accounted for by each class of pages.
-- Neil