Encountered https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Issue_with_...
Some people seem to be having problems with the mobile main page being cached too much. Can someone look into it?
-- Yuvi Panda T http://yuvi.in/blog
+wikitech-l
I've confirmed the issue on my end; ?action=purge seems to have no effect and the 'last modified' notification on the mobile main page looks correct (though the content itself is out of date and not in sync with the 'last modified' notification). What's doubly weird to me is the 'Last modified' HTTP response headers says:
Last-Modified: Tue, 30 Apr 2013 00:17:32 GMT
Which appears to be newer than when the content I'm seeing on the main page was updated... Anyone from ops have an idea what might be going on?
On Thu, May 2, 2013 at 10:01 PM, Yuvi Panda yuvipanda@gmail.com wrote:
Encountered
https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Issue_with_...
Some people seem to be having problems with the mobile main page being cached too much. Can someone look into it?
-- Yuvi Panda T http://yuvi.in/blog
Mobile-l mailing list Mobile-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mobile-l
-- Arthur Richards Software Engineer, Mobile [[User:Awjrichards]] IRC: awjr +1-415-839-6885 x6687
Works for me in the UK... I'm logged out and everything...
On Fri, May 3, 2013 at 6:01 AM, Yuvi Panda yuvipanda@gmail.com wrote:
Encountered https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Issue_with_...
Some people seem to be having problems with the mobile main page being cached too much. Can someone look into it?
-- Yuvi Panda T http://yuvi.in/blog
Mobile-l mailing list Mobile-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mobile-l
Actually, it appears fine for me now too - did someone do something to make that happen? Anyone else still seeing old content?
On Fri, May 3, 2013 at 3:00 PM, Jon Robson jdlrobson@gmail.com wrote:
Works for me in the UK... I'm logged out and everything...
That said, being logged out is not necessarily enough - the presence of certain cookies (even as a logged out users) will cause you to bypass the cache. Also, a copy of the page cached in the bucket that meets the cache variance criteria of your browser is not necessarily the same as another. Geography won't necessarily have an impact (though Accept-Encoding might).
The problem is due to recent changes that were made to how mobile caching works. I just flushed cache on all of the frontend varnish instances which indeed appears to have fixed the problem but it isn't actually fixed. Note, the frontend instances just have 1GB of cache, so only very popular objects (like the enwiki front page) avoid getting LRU'd. The backend varnish instances utilize the ssd's and perform the heavy caching work.
When I originally built this, I had the frontends force a short (300s) ttl on all cacheable objects, while the backends honored the times specified by mediawiki.
I chose to only send purges to the backend instances (via wikia's old varnishhtcpd) and let the frontend instances catch up with their short ttls. My reasoning was:
1) Our multicast purge stream is very busy and isn't split up by cache type, so it includes lots of purge requests for images on upload.wikimedia.org. Processing the purges is somewhat cpu intensive, and I saw doing so once per varnish server as preferable to twice.
2) Purges are for url's such as "en.wikipedia.org/wiki/Main_Page". The frontend varnish instance strips the m subdomain before sending the request onwards, but still caches content based on the request url. Purges are never sent for "en.m.wikipedia.org/wiki/Main_Page" - every purge would need to be rewritten to apply to the frontend varnishes. Doing this blindly would be more expensive than it should be, since a significant percentage of purge statements aren't applicable.
I don't think my original approach had any fans. Purges are now sent to both varnish instances per host, and more recently, the 300s ttl override was removed from the frontends. But all of the purges are no-ops.
There are multiple ways to approach making the purges sent to the frontends actually work such as rewriting the purges in varnish, rewriting them before they're sent to varnish depending on where they're being sent, or perhaps changing how cached objects are stored in the frontend. I personally think it's all an unnecessary waste of resources and prefer my original approach.
-Asher
On Fri, May 3, 2013 at 2:23 PM, Arthur Richards arichards@wikimedia.orgwrote:
+wikitech-l
I've confirmed the issue on my end; ?action=purge seems to have no effect and the 'last modified' notification on the mobile main page looks correct (though the content itself is out of date and not in sync with the 'last modified' notification). What's doubly weird to me is the 'Last modified' HTTP response headers says:
Last-Modified: Tue, 30 Apr 2013 00:17:32 GMT
Which appears to be newer than when the content I'm seeing on the main page was updated... Anyone from ops have an idea what might be going on?
On Thu, May 2, 2013 at 10:01 PM, Yuvi Panda yuvipanda@gmail.com wrote:
Encountered
https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Issue_with_...
Some people seem to be having problems with the mobile main page being cached too much. Can someone look into it?
-- Yuvi Panda T http://yuvi.in/blog
Mobile-l mailing list Mobile-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mobile-l
-- Arthur Richards Software Engineer, Mobile [[User:Awjrichards]] IRC: awjr +1-415-839-6885 x6687
Mobile-l mailing list Mobile-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mobile-l
On Fri, 2013-05-03 at 10:31 +0530, Yuvi Panda wrote:
Encountered https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Issue_with_...
Some people seem to be having problems with the mobile main page being cached too much. Can someone look into it?
Also reported as https://bugzilla.wikimedia.org/show_bug.cgi?id=48062
andre
Hello All,
Let me introduce myself to the list first off: My name is Matt and I am pretty experienced with HTTP and Content Relays; etc... And especially headers and things such as this. Great to be participating in such a massive realtime collaboration on email as in an ops setting! Neat... I would like to get more involved and likely (if the foundation accepts and can utilize my offers) to provide some ip blocks/dedicated nodes and haproxy layers etc and storage if it is needed. I will post a seperate thread for htis.
Anyway: Regarding this issue:
I don't think my original approach had any fans.
What is your original approach here when you went about this? Do you have a diagram at all or any process flow that you could guide me through or record a screenshot or youtube video of? :).
I always felt that Varnish was bloated but I haven't used it for about 2.5 years maybe since it was first created or around the time it became noticed first in the dev community world. Do they have the UI now for Varnish?
Purges are now sent to both varnish instances per host,
What is a Purge? Over what protocol or from which application? Or is it just a static 300 second clearing of the cache memory for this instance uri?
and more recently, the 300s ttl override was removed from the frontends.
Not sure what that means; but obviously this mix up is probably why headers and last modified times and http outputs aren't matching / syncing up right. It's going through too many layers I feel.
But all of the purges are no-ops.
Right. Because they are not clearing correct or in order time right. It's probably more of a kernal stack issue and or switch / tcp transport. Use ettercap/tcpdump -i [interface] -vv and figure out what's happening more granularly?
There are multiple ways to *approach making the purges sent to the frontends actually work*
I still would like to understand what 'sending a purge to "the frontend" does. And what a purge is and what/where or why the frontend is receiving a command statement such as this. It sounds like you're talking about BGP/ARP stuff, but you're not, right? (or at the minimum; DNS.) You simply mean the cache layer functions I am assuming?
such as rewriting the purges in varnish,
Don't do that in varnish? Whatever that is; its already having problems at the moment. Test the issue and occurance/re-duplication of this happening and occurance over time as quickly as possible perhaps by:
*rewriting them before they're sent to varnish *
Bridge a squid and ICAP layer right in front of the varnish if you want to do that ^^ The squid is just to transport the ICAP layer. You can do REQMOD and RESMOD (request and response modification) super fast and low latency and very nicely.
depending on where they're being sent, or perhaps changing how cached objects are stored in the frontend.
*What layer are cached objects stored in? Is it something like memcached or just simple http cache?*
I personally think it's all an unnecessary waste of resources and prefer my original approach.
Hahaha! wow I just read the above line. WHat is your original approach, Asher.
Can someone clarify for me the bug and intent here and guide me through a demo of it?
I'll help out right away.
---- +Matt Kaufman [ops@mi2.com] [matt@mi2.com] 703-677-8901, 202-407-7998 | skype: mi2com | gchat: mkfmncom@gmail.com
-Asher
On Fri, May 3, 2013 at 2:23 PM, Arthur Richards <arichards@wikimedia.org mailto:arichards@wikimedia.org> wrote: +wikitech-l* ** **I've confirmed the issue on my end; ?action=purge seems to have no effect and the 'last modified' notification on the mobile main page looks correct (though the content itself is out of date and not in sync with the 'last modified' notification). What's doubly weird to me is the 'Last modified' HTTP response headers says:**
**Last-Modified:* Tue, 30 Apr 2013 00:17:32 GMT
Which appears to be*newer than when the content I'm seeing on the main page* was updated...*Anyone from ops have an idea what might be going on? *
Yes this is normal in HTTP World. It doesn't really work so that is what's going on all the time.
On 5/3/2013 6:19 PM, Asher Feldman wrote:
I don't think my original approach had any fans. Purges are now sent to both varnish instances per host, and more recently, the 300s ttl override was removed from the frontends. But all of the purges are no-ops.
There are multiple ways to approach making the purges sent to the frontends actually work such as rewriting the purges in varnish, rewriting them before they're sent to varnish depending on where they're being sent, or perhaps changing how cached objects are stored in the frontend. I personally think it's all an unnecessary waste of resources and prefer my original approach.
-Asher
On Fri, May 3, 2013 at 2:23 PM, Arthur Richards <arichards@wikimedia.org mailto:arichards@wikimedia.org> wrote:
+wikitech-l I've confirmed the issue on my end; ?action=purge seems to have no effect and the 'last modified' notification on the mobile main page looks correct (though the content itself is out of date and not in sync with the 'last modified' notification). What's doubly weird to me is the 'Last modified' HTTP response headers says: Last-Modified: Tue, 30 Apr 2013 00:17:32 GMT Which appears to be newer than when the content I'm seeing on the main page was updated... Anyone from ops have an idea what might be going on?
On Fri, May 03, 2013 at 03:19:13PM -0700, Asher Feldman wrote:
- Our multicast purge stream is very busy and isn't split up by cache
type, so it includes lots of purge requests for images on upload.wikimedia.org. Processing the purges is somewhat cpu intensive, and I saw doing so once per varnish server as preferable to twice.
I believe the plan is to split up the multicast groups *and* to filter based on predefined regexps on the HTCP->PURGE layer, via the varnishhtcpd rewrite. But I may be mistaken, Mark and Brandon will know more.
There are multiple ways to approach making the purges sent to the frontends actually work such as rewriting the purges in varnish, rewriting them before they're sent to varnish depending on where they're being sent, or perhaps changing how cached objects are stored in the frontend. I personally think it's all an unnecessary waste of resources and prefer my original approach.
Although the current VCL calls vcl_recv_purge after the rewrite step (and hence actually rewriting purges too), unless I'm mistaken this is actually unnecessary. The incoming purges match the way the objects are stored in the cache: both are without the .m. (et al) prefix, as normal "desktop" purges are matched with objects that had their URLs rewritten in vcl_recv. Handling purges after the rewrite step might be unnecessary but it doesn't mean it's a bad idea though; it doesn't hurt much and it's better as it allows us to also purge via the original .m. URL, which is what a person might do instictively.
While mobile purges were actually broken recently in the past in a similar way as you guessed with I77b88f[1] ("Restrict PURGE lookups to mobile domains") they were fixed shortly after with I76e5c4[2], a full day before the frontend cache TTL was removed.
1: https://gerrit.wikimedia.org/r/#q,I77b88f3b4bb5ec84f70b2241cdd5dc496025e6fd,... 2: https://gerrit.wikimedia.org/r/#q,I76e5c4218c1dec06673aa5121010875031c1a1e2,...
What actually broke them again this time is I3d0280[3], which stripped absolute URIs before vcl_recv_purge, despite the latter having code that matches only against absolute URIs. This is my commit, so I'm responsible for this breakage, although in my defence I have an even score now for discovering the flaw last time around :)
I've pushed and merged I08f761[4] which moves rewrite_proxy_urls after vcl_recv_purge and should hopefully unbreak purging while also not reintroducing BZ #47807.
3: https://gerrit.wikimedia.org/r/#q,I3d02804170f7e502300329740cba9f45437a24fa,... 4: https://gerrit.wikimedia.org/r/#q,I08f7615230037a6ffe7d1130a2a6de7ba370faf2,...
As a side note, notice how rewrite_proxy_urls & vcl_recv_purge are both flawed in the same way: the former exists solely to workaround a Varnish bug with absolute URIs, while the latter is *depending* on that bug to manifest to actually work. req.url should always be a (relative) URL and hence the if (req.url ~ '^http:') comparison in vcl_recv_purge should normally always evaluate to false, making the whole function a no-op.
However, due to the bug in question, Varnish doesn't special-handle absolute URIs in violation of RFC 2616. This, in combination with the fact that varnishhtcpd always sends absolute URIs (due to an RFC-compliant behavior of LWP's proxy() method), is why we have this seemingly wrong VCL code but which actually works as intended.
This Varnish bug was reported by Tim upstream[5] and the fix is currently sitting in Varnish's git master[6]. It's simple enough and it might be worth it to backport it, although it might be more troulbe that it's worth, considering how it will break purges with our current VCL :)
5: https://www.varnish-cache.org/trac/ticket/1255 6: https://www.varnish-cache.org/trac/changeset/2bbb032bf67871d7d5a43a38104d58f...
Cheers, Faidon
Faidon - thanks for the more accurate trackdown, and fix!
On Sunday, May 5, 2013, Faidon Liambotis wrote:
On Fri, May 03, 2013 at 03:19:13PM -0700, Asher Feldman wrote:
- Our multicast purge stream is very busy and isn't split up by cache
type, so it includes lots of purge requests for images on upload.wikimedia.org. Processing the purges is somewhat cpu intensive,
and
I saw doing so once per varnish server as preferable to twice.
I believe the plan is to split up the multicast groups *and* to filter based on predefined regexps on the HTCP->PURGE layer, via the varnishhtcpd rewrite. But I may be mistaken, Mark and Brandon will know more.
There are multiple ways to approach making the purges sent to the
frontends
actually work such as rewriting the purges in varnish, rewriting them before they're sent to varnish depending on where they're being sent, or perhaps changing how cached objects are stored in the frontend. I personally think it's all an unnecessary waste of resources and prefer my original approach.
Although the current VCL calls vcl_recv_purge after the rewrite step (and hence actually rewriting purges too), unless I'm mistaken this is actually unnecessary. The incoming purges match the way the objects are stored in the cache: both are without the .m. (et al) prefix, as normal "desktop" purges are matched with objects that had their URLs rewritten in vcl_recv. Handling purges after the rewrite step might be unnecessary but it doesn't mean it's a bad idea though; it doesn't hurt much and it's better as it allows us to also purge via the original .m. URL, which is what a person might do instictively.
While mobile purges were actually broken recently in the past in a similar way as you guessed with I77b88f[1] ("Restrict PURGE lookups to mobile domains") they were fixed shortly after with I76e5c4[2], a full day before the frontend cache TTL was removed.
1: https://gerrit.wikimedia.org/r/#q,I77b88f3b4bb5ec84f70b2241cdd5dc496025e6fd,... 2: https://gerrit.wikimedia.org/r/#q,I76e5c4218c1dec06673aa5121010875031c1a1e2,...
What actually broke them again this time is I3d0280[3], which stripped absolute URIs before vcl_recv_purge, despite the latter having code that matches only against absolute URIs. This is my commit, so I'm responsible for this breakage, although in my defence I have an even score now for discovering the flaw last time around :)
I've pushed and merged I08f761[4] which moves rewrite_proxy_urls after vcl_recv_purge and should hopefully unbreak purging while also not reintroducing BZ #47807.
3: https://gerrit.wikimedia.org/r/#q,I3d02804170f7e502300329740cba9f45437a24fa,... 4: https://gerrit.wikimedia.org/r/#q,I08f7615230037a6ffe7d1130a2a6de7ba370faf2,...
As a side note, notice how rewrite_proxy_urls & vcl_recv_purge are both flawed in the same way: the former exists solely to workaround a Varnish bug with absolute URIs, while the latter is *depending* on that bug to manifest to actually work. req.url should always be a (relative) URL and hence the if (req.url ~ '^http:') comparison in vcl_recv_purge should normally always evaluate to false, making the whole function a no-op.
However, due to the bug in question, Varnish doesn't special-handle absolute URIs in violation of RFC 2616. This, in combination with the fact that varnishhtcpd always sends absolute URIs (due to an RFC-compliant behavior of LWP's proxy() method), is why we have this seemingly wrong VCL code but which actually works as intended.
This Varnish bug was reported by Tim upstream[5] and the fix is currently sitting in Varnish's git master[6]. It's simple enough and it might be worth it to backport it, although it might be more troulbe that it's worth, considering how it will break purges with our current VCL :)
5: https://www.varnish-cache.org/trac/ticket/1255 6: https://www.varnish-cache.org/trac/changeset/2bbb032bf67871d7d5a43a38104d58f...
Cheers, Faidon
Mobile-l mailing list Mobile-l@lists.wikimedia.org javascript:; https://lists.wikimedia.org/mailman/listinfo/mobile-l