Tim Starling wrote:
Neil Harris wrote:
The advantage of the streaming approach is (as I understand it)
- to eliminate the need for HEAD requests
No, intermediate caches can't be relied on to do that kind of revalidation, only browsers can. Wikipedia sends "Cache-Control: private,must-revalidate" which disables intermediate caches entirely. The point of the streaming approach is to allow an intermediate cache at all, that's why it was developed concurrently with our initial deployment of squid.
Maybe implementing the kind of revalidation you're talking about would be a useful step.
Ah. There's more to this than I had realised. I hadn't realised that was the case -- but I can see that it makes sense, since Wikipedia may serve up different content to different users for the same URL.
How about _eliminating_ this behaviour, by having two possible URLs for each page "/wiki/X" and "/dynwiki/X", serving exactly the same content as a /wiki/ URL serves now?
/wiki/ URLs would be for general readers, and marked as "public, must-revalidate", and would serve the same content to every user.
/dynwiki/ URLs would be for readers who may recieve content which may vary from the normal appearance sent to anons, (anons with messages, and all logged-in users) and are marked as "private, must-revalidate".
Both classes of URLs would internally be rewritten to exactly the same internal URLs, and call the same code, as at present: the difference is that /dynwiki/ pages would be non-cachable versions of the same content. Effectively, the difference between the two URLs is only a hint to any caches in the way as to whether the page is cacheable.
When we get the conditional GET which every hit will generate, we can work out which page to serve based on the message flag for anons, and the presence of user cookies for logged-in users. If you access a /wiki/ page, and you should be getting /dynwiki/ content, you will be redirected to the corresponding /dynwiki/ URL. Similarly if you are accessing dynamic content, and should be getting the static content. All of the links on a page would belong to the same base URL as the transmitted page, so the dynamic-state would be "sticky", and there would not need to be many redirects: generally, only one for every change of state from dynamic to static or vice-versa for a given user.
Web crawlers, and the rest of the world, will generally see only the /wiki/ URLs and content. Only logged-in users and anons with messages would see the /dynwiki/ content.
If this works as I imagine, it would have the effect of rendering the entire Wikipedia cachable for the (I imagine) 90%+ of readers who are not logged in. Conditional GETs would still be needed for every page, but the bulk of the data would not need to be shifted whenever there is a hit. If this works, it could substantially reduce the average number of bytes shifted per page hit.
This would also have the effect of making third-world cached sites behind thin pipes far more efficient.
It's late here, and I'm tired, and this seems too good to be true, so it probably isn't. I'll think about it again in the morning.
-- Neil