My idea for faster, cheaper access to Wikipedia in poorer countries:
http://meta.wikimedia.org/wiki/Reducing_transit_requirements
It requires patching squid and extending the HTCP protocol. The latter should be no big deal, the protocol is only in the "experimental" phase after all.
-- Tim Starling
Tim Starling wrote:
My idea for faster, cheaper access to Wikipedia in poorer countries:
http://meta.wikimedia.org/wiki/Reducing_transit_requirements
It requires patching squid and extending the HTCP protocol. The latter should be no big deal, the protocol is only in the "experimental" phase after all.
-- Tim Starling
Interesting. I don't know of anyone charging different rates for day/night IP traffic (let me know if anyone on the list knows better). However, this _is_ the indirect effect of existing pricing policies.
Most first-world wholesale IP bandwidth is (AFAIK) charged at 95-percentile rates: that is, for the lowest bandwidth below which your short-timescale traffic bandwidth falls 95% of the time. In general, the provider gives you a bigger physical pipe than you currently need, for the same reason that that credit card companies give you ever-increasing credit limits -- in the hope that your usage will grow progressively to fill it.
If you add a small increment to your traffic at your daily peak, you will push up your 95% figure. If you add it at your off-peak time, it won't. Thus, for anyone on a 95% deal, small increments in off-peak traffic are effectively "free".
-- Neil
Tim Starling wrote:
My idea for faster, cheaper access to Wikipedia in poorer countries:
http://meta.wikimedia.org/wiki/Reducing_transit_requirements
It requires patching squid and extending the HTCP protocol. The latter should be no big deal, the protocol is only in the "experimental" phase after all.
On reflection, inventing new protocols for taking a real-time feed is probably uncessary for what will be a batch operation.
How about simply making plain-ASCII logfiles available, where each line consists of
<timestamp> <url>\n
which is grown in real-time by appending to it?
People wanting to freshen their caches can then download, trim off all entries previous to the last time they fetched, uniq the file to get only a single copy of each url, and then run a script to wget all the uls from inside their cache during off-peak hours, thus freshening it.
Very little software, a twenty-line perl script, no new protocols, and (I hope) achieves the same effect.
-- Neil
Neil Harris wrote:
On reflection, inventing new protocols for taking a real-time feed is probably uncessary for what will be a batch operation.
How about simply making plain-ASCII logfiles available, where each line consists of
<timestamp> <url>\n
which is grown in real-time by appending to it?
People wanting to freshen their caches can then download, trim off all entries previous to the last time they fetched, uniq the file to get only a single copy of each url, and then run a script to wget all the uls from inside their cache during off-peak hours, thus freshening it.
Very little software, a twenty-line perl script, no new protocols, and (I hope) achieves the same effect.
The pre-fetch component is a batch operation, but the cache clear isn't. When a user edits a page, we expect everyone in the world to be able to retrieve it within a second or two. That's why we already have apaches in two countries pumping out UDP packets which are routed by various means to all our squid servers. When someone makes an edit, the worldwide cache of that page is instantly purged, and the main point of my proposal is an automated method for ISP proxies to be part of that.
We could convert the CLR stream to ASCII text, but the software to do that could equally run on every individual squid, since they all have access to that stream. Either way, we need to put a flag in the CLR packets indicating whether the item should be pre-fetched or not.
Polling the recentchanges table to construct these ASCII files isn't really an option, we tried that with the RC->IRC bots and it turned out to be too resource-intensive, which is why they also use a UDP stream these days. We could add a third UDP stream for this purpose, or create a daemon and have TCP notification, but it seems easier to me to just modify one of the existing ones slightly, by adding a flag. The REASON field has a 4-bit width, and only two codes are defined, so there's plenty of room for us to suggest our own codes. Otherwise we can just stake a claim in the RESERVED section.
-- Tim Starling
Tim Starling wrote:
The pre-fetch component is a batch operation, but the cache clear isn't. When a user edits a page, we expect everyone in the world to be able to retrieve it within a second or two. That's why we already have apaches in two countries pumping out UDP packets which are routed by various means to all our squid servers. When someone makes an edit, the worldwide cache of that page is instantly purged, and the main point of my proposal is an automated method for ISP proxies to be part of that.
We could convert the CLR stream to ASCII text, but the software to do that could equally run on every individual squid, since they all have access to that stream. Either way, we need to put a flag in the CLR packets indicating whether the item should be pre-fetched or not.
Polling the recentchanges table to construct these ASCII files isn't really an option, we tried that with the RC->IRC bots and it turned out to be too resource-intensive, which is why they also use a UDP stream these days. We could add a third UDP stream for this purpose, or create a daemon and have TCP notification, but it seems easier to me to just modify one of the existing ones slightly, by adding a flag. The REASON field has a 4-bit width, and only two codes are defined, so there's plenty of room for us to suggest our own codes. Otherwise we can just stake a claim in the RESERVED section.
-- Tim Starling
I'm not suggesting the log files should be polled in real-time: that would be silly. Rather, the URL invalidation file needs to be pulled down only once per batch pre-fetch session, a total of perhaps 6 Mbytes per day for each client site, assuming 60 chars per log entry and 100,000 edits a day. This assumes no compression: this file should gzip extremely efficiently due to the strong repetition in both URLs and timestamp strings, so the file might be only 2 or 3 Mbytes long if dowloaded with gzip on-the-fly compression.
For a remote site, Wikipedia traffic will only form a small proportion of overall traffic. A cache issuing an HTTP HEAD request to check freshness will return a tiny fraction of the number of bytes that an HTTP GET will: if the data is stale, a full GET will be required, regardless of whether the cache invalidation is done on-demand at page fetch time or by real-time streaming.
The advantage of the streaming approach is (as I understand it) * to eliminate the need for HEAD requests * to push some of the GET requests into the off-peak time, which is a win providing the page is not touched between the off-peak fetch and a user's on-peak access to the same page (but a loss if the page is edited between then and the user access, as it simply wastes off-peak bandwidth to no useful effect).
For the Wikipedia cluster, which spends 100% of its time serving Wikipedia content, the multicast HTCP solution is clearly the best solution to the problem.
For another site, the question will be whether the expected cost and bandwidth gain from the advantages of the streaming system will outweigh the cost, complexity and bandwidth loss of installing special software and subscribing to the HTCP stream at all times during the day.
It would make sense to do some modelling of the cache hit/miss patterns of off-peak prefetching before going ahead and implementing anything. For a simple model, I think it would be reasonable to expect a Zipf distribution pattern of access rates for individual pages on both the read and edit sides of the system, and to consider the gains/losses for pages updated in a Poisson fashion at various data rates, and what proportion of the traffic is accounted for by each class of pages.
-- Neil
Neil Harris wrote:
I'm not suggesting the log files should be polled in real-time: that would be silly.
That was my understanding also. I mentioned polling in the context of constructing log files, not reading them.
Rather, the URL invalidation file needs to be pulled down only once per batch pre-fetch session, a total of perhaps 6 Mbytes per day for each client site, assuming 60 chars per log entry and 100,000 edits a day. This assumes no compression: this file should gzip extremely efficiently due to the strong repetition in both URLs and timestamp strings, so the file might be only 2 or 3 Mbytes long if dowloaded with gzip on-the-fly compression.
For a remote site, Wikipedia traffic will only form a small proportion of overall traffic. A cache issuing an HTTP HEAD request to check freshness will return a tiny fraction of the number of bytes that an HTTP GET will: if the data is stale, a full GET will be required, regardless of whether the cache invalidation is done on-demand at page fetch time or by real-time streaming.
Caches use ICP, not HTTP HEAD. The only clients that use HEAD are link checkers. Browsers use a GET method with a Last-Modified header.
The advantage of the streaming approach is (as I understand it)
- to eliminate the need for HEAD requests
No, intermediate caches can't be relied on to do that kind of revalidation, only browsers can. Wikipedia sends "Cache-Control: private,must-revalidate" which disables intermediate caches entirely. The point of the streaming approach is to allow an intermediate cache at all, that's why it was developed concurrently with our initial deployment of squid.
Maybe implementing the kind of revalidation you're talking about would be a useful step.
- to push some of the GET requests into the off-peak time, which is a
win providing the page is not touched between the off-peak fetch and a user's on-peak access to the same page (but a loss if the page is edited between then and the user access, as it simply wastes off-peak bandwidth to no useful effect).
[...]
That's ancillary, and it hasn't been developed yet.
-- Tim Starling
Tim Starling wrote:
Neil Harris wrote:
The advantage of the streaming approach is (as I understand it)
- to eliminate the need for HEAD requests
No, intermediate caches can't be relied on to do that kind of revalidation, only browsers can. Wikipedia sends "Cache-Control: private,must-revalidate" which disables intermediate caches entirely. The point of the streaming approach is to allow an intermediate cache at all, that's why it was developed concurrently with our initial deployment of squid.
Maybe implementing the kind of revalidation you're talking about would be a useful step.
Ah. There's more to this than I had realised. I hadn't realised that was the case -- but I can see that it makes sense, since Wikipedia may serve up different content to different users for the same URL.
How about _eliminating_ this behaviour, by having two possible URLs for each page "/wiki/X" and "/dynwiki/X", serving exactly the same content as a /wiki/ URL serves now?
/wiki/ URLs would be for general readers, and marked as "public, must-revalidate", and would serve the same content to every user.
/dynwiki/ URLs would be for readers who may recieve content which may vary from the normal appearance sent to anons, (anons with messages, and all logged-in users) and are marked as "private, must-revalidate".
Both classes of URLs would internally be rewritten to exactly the same internal URLs, and call the same code, as at present: the difference is that /dynwiki/ pages would be non-cachable versions of the same content. Effectively, the difference between the two URLs is only a hint to any caches in the way as to whether the page is cacheable.
When we get the conditional GET which every hit will generate, we can work out which page to serve based on the message flag for anons, and the presence of user cookies for logged-in users. If you access a /wiki/ page, and you should be getting /dynwiki/ content, you will be redirected to the corresponding /dynwiki/ URL. Similarly if you are accessing dynamic content, and should be getting the static content. All of the links on a page would belong to the same base URL as the transmitted page, so the dynamic-state would be "sticky", and there would not need to be many redirects: generally, only one for every change of state from dynamic to static or vice-versa for a given user.
Web crawlers, and the rest of the world, will generally see only the /wiki/ URLs and content. Only logged-in users and anons with messages would see the /dynwiki/ content.
If this works as I imagine, it would have the effect of rendering the entire Wikipedia cachable for the (I imagine) 90%+ of readers who are not logged in. Conditional GETs would still be needed for every page, but the bulk of the data would not need to be shifted whenever there is a hit. If this works, it could substantially reduce the average number of bytes shifted per page hit.
This would also have the effect of making third-world cached sites behind thin pipes far more efficient.
It's late here, and I'm tired, and this seems too good to be true, so it probably isn't. I'll think about it again in the morning.
-- Neil
wikitech-l@lists.wikimedia.org