On Mon, Jan 12, 2015 at 5:23 AM, Gilles Dubuc <gilles(a)wikimedia.org> wrote:
I don't think that your query was the right way to
look at Varnish misses
pulling an image from Swift, consider this one,
which is my attempt to fix
it:
SELECT sum(event_request)/sum(event_total) request,
sum(event_response)/sum(event_total) response,
sum(event_tcp)/sum(event_total) tcp, count(*) count FROM
MultimediaViewerNetworkPerformance_10774577 WHERE event_type ='image' AND
event_response IS NOT NULL AND event_request IS NOT NULL AND
event_varnish1hits = 0 AND event_varnish2hits = 0 AND event_varnish3hits =
0 AND event_timestamp - event_lastModified > 60 AND event_total > 1000
request: 0.2031
response: 0.3213
tcp: 0.0132
count: 207
Very different results and 48% of the time is unaccounted for.
The difference seems to be due to the event_varnish3hits = 0 condition:
SELECT sum(event_request)/sum(event_total) request,
sum(event_response)/sum(event_total) response,
sum(event_tcp)/sum(event_total) tcp, sum(event_dns)/sum(event_total) dns,
sum(event_redirect)/sum(event_total) redirect,
sum(event_cache)/sum(event_total) cache, avg(event_total) avg, count(*)
count FROM MultimediaViewerNetworkPerformance_10774577 WHERE event_type
='image' AND event_response IS NOT NULL AND event_request IS NOT NULL AND
event_varnish1hits = 0 AND event_varnish2hits = 0 AND *event_varnish3hits =
0* AND event_timestamp - event_lastModified > 60 AND event_total > 1000;
+---------+----------+--------+--------+----------+--------+------------+-------+
| request | response | tcp | dns | redirect | cache | avg |
count |
+---------+----------+--------+--------+----------+--------+------------+-------+
| 0.2080 | 0.3245 | 0.0131 | 0.0032 | 0.0000 | 0.1307 | 11897.6429 |
224 |
+---------+----------+--------+--------+----------+--------+------------+-------+
SELECT sum(event_request)/sum(event_total) request,
sum(event_response)/sum(event_total) response,
sum(event_tcp)/sum(event_total) tcp, sum(event_dns)/sum(event_total) dns,
sum(event_redirect)/sum(event_total) redirect,
sum(event_cache)/sum(event_total) cache, avg(event_total) avg, count(*)
count FROM MultimediaViewerNetworkPerformance_10774577 WHERE event_type
='image' AND event_response IS NOT NULL AND event_request IS NOT NULL AND
event_varnish1hits = 0 AND event_varnish2hits = 0 AND *event_varnish3hits
IS NULL* AND event_timestamp - event_lastModified > 60 AND event_total >
1000;
+---------+----------+--------+--------+----------+--------+-----------+-------+
| request | response | tcp | dns | redirect | cache | avg |
count |
+---------+----------+--------+--------+----------+--------+-----------+-------+
| 0.1204 | 0.7008 | 0.0791 | 0.0073 | 0.0000 | 0.0050 | 6396.3389 |
11902 |
+---------+----------+--------+--------+----------+--------+-----------+-------+
which is an error in our logging (or log aggregation) system - some
requests (apparently most of the slow ones) only pass through two tiers of
varnish cache and XCache will look like
cp1067 miss (0), cp1054 frontend miss (0)
and not
cp1053 miss (0), amssq43 miss (0), amssq48 frontend miss (0)
(This might be a good time to ask, what do three varnish entries even mean?
AFAIK we only have to tiers, front and back.)
The results look saner after including NULL for varnish3hits, although 10%
of the time is still unaccounted and must somehow fall into one of the
"gaps" of the ResourceTiming processing model
<http://www.w3.org/TR/resource-timing/#processing-model>, so we should
probably log more fields. But I think the assertion holds that image load
time is dominated by delays outside WMF's network.
(The 13% app cache time for requests with three varnish misses is still
weird. That means on average over 1s cache lookup time for requests which
weren't in the cache. Maybe AJAX preloading of images fails in a strange
way on some browser?)
Anyway I don't think that requestStart is the
right thing to look at to
measure what's happening behind the scenes, the time the local browser has
emitted the request isn't the time the server received it, nor is the time
the response starts being received proof of anything about the server side
of things. If varnish reported its own time breakdown through custom
headers, it would truly answer that question.
It's an upper estimate; I think it is fair to assume that the request will
be much faster than the response for an image which is large enough to take
5s to load. This assumes that varnish does not start
sending the file
before fully receiving it, which might indeed be false; I agree we
should
instrument the process better via headers.