Re: [Wikitech-l] [WikimediaMobile] Caching Problem with Mobile Main Page?

List overview All Threads
Download

newer

older

High-resolution Score images for...

Extensions History in enwiki

Arthur Richards

3 May 2013 3 May '13

11:23 p.m.

+wikitech-l

I've confirmed the issue on my end; ?action=purge seems to have no effect and the 'last modified' notification on the mobile main page looks correct (though the content itself is out of date and not in sync with the 'last modified' notification). What's doubly weird to me is the 'Last modified' HTTP response headers says:

Last-Modified: Tue, 30 Apr 2013 00:17:32 GMT

Which appears to be newer than when the content I'm seeing on the main page was updated... Anyone from ops have an idea what might be going on?

On Thu, May 2, 2013 at 10:01 PM, Yuvi Panda yuvipanda@gmail.com wrote:

...

Encountered

https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Issue_with_...

...

Some people seem to be having problems with the mobile main page being cached too much. Can someone look into it?

-- Yuvi Panda T http://yuvi.in/blog

Mobile-l mailing list Mobile-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mobile-l

-- Arthur Richards Software Engineer, Mobile [[User:Awjrichards]] IRC: awjr +1-415-839-6885 x6687

Show replies by date

Asher Feldman

4 May 4 May

12:19 a.m.

New subject: [WikimediaMobile] Caching Problem with Mobile Main Page?

The problem is due to recent changes that were made to how mobile caching works. I just flushed cache on all of the frontend varnish instances which indeed appears to have fixed the problem but it isn't actually fixed. Note, the frontend instances just have 1GB of cache, so only very popular objects (like the enwiki front page) avoid getting LRU'd. The backend varnish instances utilize the ssd's and perform the heavy caching work.

When I originally built this, I had the frontends force a short (300s) ttl on all cacheable objects, while the backends honored the times specified by mediawiki.

I chose to only send purges to the backend instances (via wikia's old varnishhtcpd) and let the frontend instances catch up with their short ttls. My reasoning was:

1) Our multicast purge stream is very busy and isn't split up by cache type, so it includes lots of purge requests for images on upload.wikimedia.org. Processing the purges is somewhat cpu intensive, and I saw doing so once per varnish server as preferable to twice.

2) Purges are for url's such as "en.wikipedia.org/wiki/Main_Page". The frontend varnish instance strips the m subdomain before sending the request onwards, but still caches content based on the request url. Purges are never sent for "en.m.wikipedia.org/wiki/Main_Page" - every purge would need to be rewritten to apply to the frontend varnishes. Doing this blindly would be more expensive than it should be, since a significant percentage of purge statements aren't applicable.

I don't think my original approach had any fans. Purges are now sent to both varnish instances per host, and more recently, the 300s ttl override was removed from the frontends. But all of the purges are no-ops.

There are multiple ways to approach making the purges sent to the frontends actually work such as rewriting the purges in varnish, rewriting them before they're sent to varnish depending on where they're being sent, or perhaps changing how cached objects are stored in the frontend. I personally think it's all an unnecessary waste of resources and prefer my original approach.

-Asher

On Fri, May 3, 2013 at 2:23 PM, Arthur Richards arichards@wikimedia.orgwrote:

...

+wikitech-l

I've confirmed the issue on my end; ?action=purge seems to have no effect and the 'last modified' notification on the mobile main page looks correct (though the content itself is out of date and not in sync with the 'last modified' notification). What's doubly weird to me is the 'Last modified' HTTP response headers says:

Last-Modified: Tue, 30 Apr 2013 00:17:32 GMT

Which appears to be newer than when the content I'm seeing on the main page was updated... Anyone from ops have an idea what might be going on?

On Thu, May 2, 2013 at 10:01 PM, Yuvi Panda yuvipanda@gmail.com wrote:

...
Encountered

https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Issue_with_...

...
Some people seem to be having problems with the mobile main page being cached too much. Can someone look into it?

-- Yuvi Panda T http://yuvi.in/blog

Mobile-l mailing list Mobile-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mobile-l

-- Arthur Richards Software Engineer, Mobile [[User:Awjrichards]] IRC: awjr +1-415-839-6885 x6687

Mobile-l mailing list Mobile-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mobile-l

Faidon Liambotis

5 May 5 May

11:33 a.m.

New subject: [WikimediaMobile] Caching Problem with Mobile Main Page?

On Fri, May 03, 2013 at 03:19:13PM -0700, Asher Feldman wrote:

...

Our multicast purge stream is very busy and isn't split up by cache

type, so it includes lots of purge requests for images on upload.wikimedia.org. Processing the purges is somewhat cpu intensive, and I saw doing so once per varnish server as preferable to twice.

I believe the plan is to split up the multicast groups *and* to filter based on predefined regexps on the HTCP->PURGE layer, via the varnishhtcpd rewrite. But I may be mistaken, Mark and Brandon will know more.

...

There are multiple ways to approach making the purges sent to the frontends actually work such as rewriting the purges in varnish, rewriting them before they're sent to varnish depending on where they're being sent, or perhaps changing how cached objects are stored in the frontend. I personally think it's all an unnecessary waste of resources and prefer my original approach.

Although the current VCL calls vcl_recv_purge after the rewrite step (and hence actually rewriting purges too), unless I'm mistaken this is actually unnecessary. The incoming purges match the way the objects are stored in the cache: both are without the .m. (et al) prefix, as normal "desktop" purges are matched with objects that had their URLs rewritten in vcl_recv. Handling purges after the rewrite step might be unnecessary but it doesn't mean it's a bad idea though; it doesn't hurt much and it's better as it allows us to also purge via the original .m. URL, which is what a person might do instictively.

While mobile purges were actually broken recently in the past in a similar way as you guessed with I77b88f[1] ("Restrict PURGE lookups to mobile domains") they were fixed shortly after with I76e5c4[2], a full day before the frontend cache TTL was removed.

1: https://gerrit.wikimedia.org/r/#q,I77b88f3b4bb5ec84f70b2241cdd5dc496025e6fd,... 2: https://gerrit.wikimedia.org/r/#q,I76e5c4218c1dec06673aa5121010875031c1a1e2,...

What actually broke them again this time is I3d0280[3], which stripped absolute URIs before vcl_recv_purge, despite the latter having code that matches only against absolute URIs. This is my commit, so I'm responsible for this breakage, although in my defence I have an even score now for discovering the flaw last time around :)

I've pushed and merged I08f761[4] which moves rewrite_proxy_urls after vcl_recv_purge and should hopefully unbreak purging while also not reintroducing BZ #47807.

3: https://gerrit.wikimedia.org/r/#q,I3d02804170f7e502300329740cba9f45437a24fa,... 4: https://gerrit.wikimedia.org/r/#q,I08f7615230037a6ffe7d1130a2a6de7ba370faf2,...

As a side note, notice how rewrite_proxy_urls & vcl_recv_purge are both flawed in the same way: the former exists solely to workaround a Varnish bug with absolute URIs, while the latter is *depending* on that bug to manifest to actually work. req.url should always be a (relative) URL and hence the if (req.url ~ '^http:') comparison in vcl_recv_purge should normally always evaluate to false, making the whole function a no-op.

However, due to the bug in question, Varnish doesn't special-handle absolute URIs in violation of RFC 2616. This, in combination with the fact that varnishhtcpd always sends absolute URIs (due to an RFC-compliant behavior of LWP's proxy() method), is why we have this seemingly wrong VCL code but which actually works as intended.

This Varnish bug was reported by Tim upstream[5] and the fix is currently sitting in Varnish's git master[6]. It's simple enough and it might be worth it to backport it, although it might be more troulbe that it's worth, considering how it will break purges with our current VCL :)

5: https://www.varnish-cache.org/trac/ticket/1255 6: https://www.varnish-cache.org/trac/changeset/2bbb032bf67871d7d5a43a38104d58f...

Cheers, Faidon

Asher Feldman

7:16 p.m.

New subject: [WikimediaMobile] Caching Problem with Mobile Main Page?

Faidon - thanks for the more accurate trackdown, and fix!

On Sunday, May 5, 2013, Faidon Liambotis wrote:

...

On Fri, May 03, 2013 at 03:19:13PM -0700, Asher Feldman wrote:

...

Our multicast purge stream is very busy and isn't split up by cache

type, so it includes lots of purge requests for images on upload.wikimedia.org. Processing the purges is somewhat cpu intensive,

and

...
I saw doing so once per varnish server as preferable to twice.

I believe the plan is to split up the multicast groups *and* to filter based on predefined regexps on the HTCP->PURGE layer, via the varnishhtcpd rewrite. But I may be mistaken, Mark and Brandon will know more.

...
There are multiple ways to approach making the purges sent to the

frontends

...
actually work such as rewriting the purges in varnish, rewriting them before they're sent to varnish depending on where they're being sent, or perhaps changing how cached objects are stored in the frontend. I personally think it's all an unnecessary waste of resources and prefer my original approach.

Although the current VCL calls vcl_recv_purge after the rewrite step (and hence actually rewriting purges too), unless I'm mistaken this is actually unnecessary. The incoming purges match the way the objects are stored in the cache: both are without the .m. (et al) prefix, as normal "desktop" purges are matched with objects that had their URLs rewritten in vcl_recv. Handling purges after the rewrite step might be unnecessary but it doesn't mean it's a bad idea though; it doesn't hurt much and it's better as it allows us to also purge via the original .m. URL, which is what a person might do instictively.

While mobile purges were actually broken recently in the past in a similar way as you guessed with I77b88f[1] ("Restrict PURGE lookups to mobile domains") they were fixed shortly after with I76e5c4[2], a full day before the frontend cache TTL was removed.

1: https://gerrit.wikimedia.org/r/#q,I77b88f3b4bb5ec84f70b2241cdd5dc496025e6fd,... 2: https://gerrit.wikimedia.org/r/#q,I76e5c4218c1dec06673aa5121010875031c1a1e2,...

What actually broke them again this time is I3d0280[3], which stripped absolute URIs before vcl_recv_purge, despite the latter having code that matches only against absolute URIs. This is my commit, so I'm responsible for this breakage, although in my defence I have an even score now for discovering the flaw last time around :)

I've pushed and merged I08f761[4] which moves rewrite_proxy_urls after vcl_recv_purge and should hopefully unbreak purging while also not reintroducing BZ #47807.

3: https://gerrit.wikimedia.org/r/#q,I3d02804170f7e502300329740cba9f45437a24fa,... 4: https://gerrit.wikimedia.org/r/#q,I08f7615230037a6ffe7d1130a2a6de7ba370faf2,...

As a side note, notice how rewrite_proxy_urls & vcl_recv_purge are both flawed in the same way: the former exists solely to workaround a Varnish bug with absolute URIs, while the latter is *depending* on that bug to manifest to actually work. req.url should always be a (relative) URL and hence the if (req.url ~ '^http:') comparison in vcl_recv_purge should normally always evaluate to false, making the whole function a no-op.

However, due to the bug in question, Varnish doesn't special-handle absolute URIs in violation of RFC 2616. This, in combination with the fact that varnishhtcpd always sends absolute URIs (due to an RFC-compliant behavior of LWP's proxy() method), is why we have this seemingly wrong VCL code but which actually works as intended.

This Varnish bug was reported by Tim upstream[5] and the fix is currently sitting in Varnish's git master[6]. It's simple enough and it might be worth it to backport it, although it might be more troulbe that it's worth, considering how it will break purges with our current VCL :)

5: https://www.varnish-cache.org/trac/ticket/1255 6: https://www.varnish-cache.org/trac/changeset/2bbb032bf67871d7d5a43a38104d58f...

Cheers, Faidon

Mobile-l mailing list Mobile-l@lists.wikimedia.org javascript:; https://lists.wikimedia.org/mailman/listinfo/mobile-l

4245

Age (days ago)

4247

Last active (days ago)

wikitech-l@lists.wikimedia.org

3 comments

3 participants

tags (0)

participants (3)

Arthur Richards
Asher Feldman
Faidon Liambotis