Re: [Wikitech-l] Re: Idea for faster, cheaper access to Wikipedia in poorer countries

16 Oct 2005


      Tim Starling wrote:
...
The pre-fetch component is a batch operation, but the cache clear isn't.
When a user edits a page, we expect everyone in the world to be able to
retrieve it within a second or two. That's why we already have apaches
in two countries pumping out UDP packets which are routed by various
means to all our squid servers. When someone makes an edit, the
worldwide cache of that page is instantly purged, and the main point of
my proposal is an automated method for ISP proxies to be part of that.
We could convert the CLR stream to ASCII text, but the software to do
that could equally run on every individual squid, since they all have
access to that stream. Either way, we need to put a flag in the CLR
packets indicating whether the item should be pre-fetched or not.
Polling the recentchanges table to construct these ASCII files isn't
really an option, we tried that with the RC->IRC bots and it turned out
to be too resource-intensive, which is why they also use a UDP stream
these days. We could add a third UDP stream for this purpose, or create
a daemon and have TCP notification, but it seems easier to me to just
modify one of the existing ones slightly, by adding a flag. The REASON
field has a 4-bit width, and only two codes are defined, so there's
plenty of room for us to suggest our own codes. Otherwise we can just
stake a claim in the RESERVED section.
-- Tim Starling
I'm not suggesting the log files should be polled in real-time: that 
would be silly. Rather, the URL invalidation file needs to be pulled 
down only once per batch pre-fetch session, a total of perhaps 6 Mbytes 
per day for each client site, assuming 60 chars per log entry and 
100,000 edits a day. This assumes no compression: this file should gzip 
extremely efficiently due to the strong repetition in both URLs and 
timestamp strings, so the file might be only 2 or 3 Mbytes long if 
dowloaded with gzip on-the-fly compression.
For a remote site, Wikipedia traffic will only form a small proportion 
of overall traffic. A cache issuing an HTTP HEAD request to check 
freshness will return a tiny fraction of the number of bytes that an 
HTTP GET will: if the data is stale, a full GET will be required, 
regardless of whether the cache invalidation is done on-demand at page 
fetch time or by real-time streaming.
The advantage of the streaming approach is (as I understand it)
* to eliminate the need for HEAD requests
* to push some of the GET requests into the off-peak time, which is a 
win providing the page is not touched between the off-peak fetch and a 
user's on-peak access to the same page (but a loss if the page is edited 
between then and the user access, as it simply wastes off-peak bandwidth 
to no useful effect).
For the Wikipedia cluster, which spends 100% of its time serving 
Wikipedia content, the multicast HTCP solution is clearly the best 
solution to the problem.
For another site, the question will be whether the expected cost and 
bandwidth gain from the advantages of the streaming system will outweigh 
the cost, complexity and bandwidth loss of installing special software 
and subscribing to the HTCP stream at all times during the day.
It would make sense to do some modelling of the cache hit/miss patterns 
of off-peak prefetching before going ahead and implementing anything. 
For a simple model, I think it would be reasonable to expect a Zipf 
distribution pattern of access rates for individual pages on both the 
read and edit sides of the system, and to consider the gains/losses for 
pages updated in a Poisson fashion at various data rates, and what 
proportion of the traffic is accounted for by each class of pages.
-- Neil

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Re: Idea for faster, cheaper access to Wikipedia in poorer countries