suggests that most of the time is network delay between Varnish and the browser so it is more useful to think about CDNs than about caching strategies.
(Also I wonder what the missing 20% is. Seems a bit high for DNS / TCP handshakes.)

I don't think that your query was the right way to look at Varnish misses pulling an image from Swift, consider this one, which is my attempt to fix it:

SELECT sum(event_request)/sum(event_total) request, sum(event_response)/sum(event_total) response, sum(event_tcp)/sum(event_total) tcp, count(*) count FROM MultimediaViewerNetworkPerformance_10774577 WHERE event_type ='image' AND event_response IS NOT NULL AND event_request IS NOT NULL AND event_varnish1hits = 0 AND event_varnish2hits = 0 AND event_varnish3hits = 0 AND event_timestamp - event_lastModified > 60 AND event_total > 1000

request: 0.2031
response: 0.3213
tcp: 0.0132
count: 207

Very different results and 48% of the time is unaccounted for. Considering the Resource Timing model, time could be spent in the app cache or DNS. Seems strange, but without figures it's hard to tell. Anyway I don't think that requestStart is the right thing to look at to measure what's happening behind the scenes, the time the local browser has emitted the request isn't the time the server received it, nor is the time the response starts being received proof of anything about the server side of things. If varnish reported its own time breakdown through custom headers, it would truly answer that question.

What is the worst case negative impact of this test?

The benefit of prerendering was too small to be measured when it was deployed, and my findings hint at why: only 0.085% of images served in media viewer should be affected by prerendering (varnish misses that don't find a pre-existing image stored in Swift). I'm turning it off to confirm the fact that basically nothing should happen and media viewer performance shouldn't change. It's as risk-free as it gets because it's only seeking confirmation, but there could always be a surprise.

Maybe images uploaded before a certain time, or similar.

This information has been recorded as well and was what let me confirm that pre-rendering being turned on had no measurable effect, because I was able to look at images uploaded before and after prerendering was active. It did lead me to find out about https://phabricator.wikimedia.org/T84980 but that seems to be caused by the way Varnish works. It could be part of the explanation, though. If performance can be as much as 36% worse for images freshly uploaded and added to Varnish, maybe the same phenomenon (that we can't measure with the current data because we'd need to track these stats per-image) happens again when they're re-added to Varnish from Swift. And while at first glance it looks like a smaller effect (only 36% slower at worst compared to 100% slower at worst with the "swift pull" phenomenon), if the same issue does happen on varnish re-adds, then the average figures on non-fresh uploads might included suboptimal requests affected by the same issue. Thus making the two figures separated by 36% closer than they really are if this doesn't affect only fresh uploads.

To try and describe the issue simply, data suggests that varnish needs warming up before reaching its optimal performance. The more an image is accessed, the faster it gets served by varnish. And when images fall out of varnish because they haven't been accessed in a while, the next time they are, that warmup might need to happen again. Which could be the one of the reasons why those "swift pulls" are so much slower. And things are probably made worse by the fact that those images being infrequently accessed, they're likely to fall out of varnish again before the next time they're requested. And even if they don't completely fall out, their warmup period could be stretched over a much longer period of time than frequently accessed images. Without per-image data or varnish reporting more things through custom headers, that explanation is theoretical and might not be everything that's at stake.

Federico, regarding geography and file types, the sample sizes I've been looking at for the main phenomenons shouldn't be affected by that, as the size, type and geo mix over a large amount of requests and a long period of time should be fairly consistent. Of course what matters is having large sample sizes.

FWIW here's what swift sees for time to first byte on HTTP GET in which it returned 200 (99% percentile and average)

Is this the latency between Varnish and Swift? Or does it also include web servers/workers pulling originals from Swift?


On Fri, Jan 9, 2015 at 8:54 PM, Gergo Tisza <gtisza@wikimedia.org> wrote:

​Thanks for digging into this, Gilles! It's great that we are becoming more data-driven on this.

IMO the next step should be differentiating between Varnish delay and network delay. event_response should tell us the time between first and last response byte; so Swift delay will be in event_request and network delay in event_response (or so I think, but that should be verified).

A quick first attempt:

mysql:research@analytics-store.eqiad.wmnet [log]> select sum(event_request)/sum(event_total) request, sum(event_response)/sum(event_total) response, count(*) count from MultimediaViewerNetworkPerformance_10774577 where event_type ='image' and event_response is not null and event_XCache not like '%hit%' and event_total > 5000 and date(timestamp(timestamp)) = current_date()\G
*************************** 1. row ***************************
request: 0.0800
response: 0.7390
count: 536

suggests that most of the time is network delay between Varnish and the browser so it is more useful to think about CDNs than about caching strategies.
(Also I wonder what the missing 20% is. Seems a bit high for DNS / TCP handshakes.)