The problem is due to recent changes that were made to how mobile caching works. I just flushed cache on all of the frontend varnish instances which indeed appears to have fixed the problem but it isn't actually fixed. Note, the frontend instances just have 1GB of cache, so only very popular objects (like the enwiki front page) avoid getting LRU'd. The backend varnish instances utilize the ssd's and perform the heavy caching work.
When I originally built this, I had the frontends force a short (300s) ttl on all cacheable objects, while the backends honored the times specified by mediawiki.
I chose to only send purges to the backend instances (via wikia's old varnishhtcpd) and let the frontend instances catch up with their short ttls. My reasoning was:
1) Our multicast purge stream is very busy and isn't split up by cache type, so it includes lots of purge requests for images on upload.wikimedia.org. Processing the purges is somewhat cpu intensive, and I saw doing so once per varnish server as preferable to twice.
2) Purges are for url's such as "en.wikipedia.org/wiki/Main_Page". The frontend varnish instance strips the m subdomain before sending the request onwards, but still caches content based on the request url. Purges are never sent for "en.m.wikipedia.org/wiki/Main_Page" - every purge would need to be rewritten to apply to the frontend varnishes. Doing this blindly would be more expensive than it should be, since a significant percentage of purge statements aren't applicable.
I don't think my original approach had any fans. Purges are now sent to both varnish instances per host, and more recently, the 300s ttl override was removed from the frontends. But all of the purges are no-ops.
There are multiple ways to approach making the purges sent to the frontends actually work such as rewriting the purges in varnish, rewriting them before they're sent to varnish depending on where they're being sent, or perhaps changing how cached objects are stored in the frontend. I personally think it's all an unnecessary waste of resources and prefer my original approach.
-Asher
On Fri, May 3, 2013 at 2:23 PM, Arthur Richards arichards@wikimedia.orgwrote:
+wikitech-l
I've confirmed the issue on my end; ?action=purge seems to have no effect and the 'last modified' notification on the mobile main page looks correct (though the content itself is out of date and not in sync with the 'last modified' notification). What's doubly weird to me is the 'Last modified' HTTP response headers says:
Last-Modified: Tue, 30 Apr 2013 00:17:32 GMT
Which appears to be newer than when the content I'm seeing on the main page was updated... Anyone from ops have an idea what might be going on?
On Thu, May 2, 2013 at 10:01 PM, Yuvi Panda yuvipanda@gmail.com wrote:
Encountered
https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Issue_with_...
Some people seem to be having problems with the mobile main page being cached too much. Can someone look into it?
-- Yuvi Panda T http://yuvi.in/blog
Mobile-l mailing list Mobile-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mobile-l
-- Arthur Richards Software Engineer, Mobile [[User:Awjrichards]] IRC: awjr +1-415-839-6885 x6687
Mobile-l mailing list Mobile-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mobile-l