suggests that most of the time is network delay between Varnish and the
browser so it is more useful to think about CDNs than about caching
strategies.
(Also I wonder what the missing 20% is. Seems a bit high for DNS / TCP
handshakes.)
I don't think that your query was the right way to look at Varnish misses
pulling an image from Swift, consider this one, which is my attempt to fix
it:
SELECT sum(event_request)/sum(event_total) request,
sum(event_response)/sum(event_total) response,
sum(event_tcp)/sum(event_total) tcp, count(*) count FROM
MultimediaViewerNetworkPerformance_10774577 WHERE event_type ='image' AND
event_response IS NOT NULL AND event_request IS NOT NULL AND
event_varnish1hits = 0 AND event_varnish2hits = 0 AND event_varnish3hits =
0 AND event_timestamp - event_lastModified > 60 AND event_total > 1000
request: 0.2031
response: 0.3213
tcp: 0.0132
count: 207
Very different results and 48% of the time is unaccounted for. Considering
the Resource Timing model, time could be spent in the app cache or DNS.
Seems strange, but without figures it's hard to tell. Anyway I don't think
that requestStart is the right thing to look at to measure what's happening
behind the scenes, the time the local browser has emitted the request isn't
the time the server received it, nor is the time the response starts being
received proof of anything about the server side of things. If varnish
reported its own time breakdown through custom headers, it would truly
answer that question.
What is the worst case negative impact of this test?
The benefit of prerendering was too small to be measured when it was
deployed, and my findings hint at why: only 0.085% of images served in
media viewer should be affected by prerendering (varnish misses that don't
find a pre-existing image stored in Swift). I'm turning it off to confirm
the fact that basically nothing should happen and media viewer performance
shouldn't change. It's as risk-free as it gets because it's only seeking
confirmation, but there could always be a surprise.
Maybe images uploaded before a certain time, or similar.
This information has been recorded as well and was what let me confirm that
pre-rendering being turned on had no measurable effect, because I was able
to look at images uploaded before and after prerendering was active. It did
lead me to find out about
https://phabricator.wikimedia.org/T84980 but that
seems to be caused by the way Varnish works. It could be part of the
explanation, though. If performance can be as much as 36% worse for images
freshly uploaded and added to Varnish, maybe the same phenomenon (that we
can't measure with the current data because we'd need to track these stats
per-image) happens again when they're re-added to Varnish from Swift. And
while at first glance it looks like a smaller effect (only 36% slower at
worst compared to 100% slower at worst with the "swift pull" phenomenon),
if the same issue does happen on varnish re-adds, then the average figures
on non-fresh uploads might included suboptimal requests affected by the
same issue. Thus making the two figures separated by 36% closer than they
really are if this doesn't affect only fresh uploads.
To try and describe the issue simply, data suggests that varnish needs
warming up before reaching its optimal performance. The more an image is
accessed, the faster it gets served by varnish. And when images fall out of
varnish because they haven't been accessed in a while, the next time they
are, that warmup might need to happen again. Which could be the one of the
reasons why those "swift pulls" are so much slower. And things are probably
made worse by the fact that those images being infrequently accessed,
they're likely to fall out of varnish again before the next time they're
requested. And even if they don't completely fall out, their warmup period
could be stretched over a much longer period of time than frequently
accessed images. Without per-image data or varnish reporting more things
through custom headers, that explanation is theoretical and might not be
everything that's at stake.
Federico, regarding geography and file types, the sample sizes I've been
looking at for the main phenomenons shouldn't be affected by that, as the
size, type and geo mix over a large amount of requests and a long period of
time should be fairly consistent. Of course what matters is having large
sample sizes.
FWIW here's what swift sees for time to first byte on HTTP GET in which it
> returned 200 (99% percentile and average)
Is this the latency between Varnish and Swift? Or does it also include web
servers/workers pulling originals from Swift?
On Fri, Jan 9, 2015 at 8:54 PM, Gergo Tisza <gtisza(a)wikimedia.org> wrote:
> Thanks for digging into this, Gilles!
It's great that we are becoming
> more data-driven on this.
> IMO the next step should be differentiating
between Varnish delay and
> network delay. event_response should tell us the time between first and
> last response byte; so Swift delay will be in event_request and network
> delay in event_response (or so I think, but that should be verified).
> A quick first attempt:
> mysql:research@analytics-store.eqiad.wmnet
[log]> select
> sum(event_request)/sum(event_total) request,
> sum(event_response)/sum(event_total) response, count(*) count from
> MultimediaViewerNetworkPerformance_10774577 where event_type ='image' and
> event_response is not null and event_XCache not like '%hit%' and
> event_total > 5000 and date(timestamp(timestamp)) = current_date()\G
> *************************** 1. row ***************************
> request: 0.0800
> response: 0.7390
> count: 536
suggests that most of the time is network delay between Varnish and the
browser so it is more useful to think about CDNs than about caching
strategies.
(Also I wonder what the missing 20% is. Seems a bit high for DNS / TCP
handshakes.)