On Mon, Jan 12, 2015 at 5:23 AM, Gilles Dubuc <gilles@wikimedia.org> wrote:
I don't think that your query was the right way to look at Varnish misses pulling an image from Swift, consider this one, which is my attempt to fix it:

SELECT sum(event_request)/sum(event_total) request, sum(event_response)/sum(event_total) response, sum(event_tcp)/sum(event_total) tcp, count(*) count FROM MultimediaViewerNetworkPerformance_10774577 WHERE event_type ='image' AND event_response IS NOT NULL AND event_request IS NOT NULL AND event_varnish1hits = 0 AND event_varnish2hits = 0 AND event_varnish3hits = 0 AND event_timestamp - event_lastModified > 60 AND event_total > 1000

request: 0.2031
response: 0.3213
tcp: 0.0132
count: 207

Very different results and 48% of the time is unaccounted for.

The difference seems to be due to the event_varnish3hits = 0 condition:

SELECT sum(event_request)/sum(event_total) request, sum(event_response)/sum(event_total) response, sum(event_tcp)/sum(event_total) tcp, sum(event_dns)/sum(event_total) dns, sum(event_redirect)/sum(event_total) redirect, sum(event_cache)/sum(event_total) cache, avg(event_total) avg, count(*) count FROM MultimediaViewerNetworkPerformance_10774577 WHERE event_type ='image' AND event_response IS NOT NULL AND event_request IS NOT NULL AND event_varnish1hits = 0 AND event_varnish2hits = 0 AND event_varnish3hits = 0  AND event_timestamp - event_lastModified > 60 AND event_total > 1000;
+---------+----------+--------+--------+----------+--------+------------+-------+
| request | response | tcp    | dns    | redirect | cache  | avg        | count |
+---------+----------+--------+--------+----------+--------+------------+-------+
|  0.2080 |   0.3245 | 0.0131 | 0.0032 |   0.0000 | 0.1307 | 11897.6429 |   224 |
+---------+----------+--------+--------+----------+--------+------------+-------+

SELECT sum(event_request)/sum(event_total) request, sum(event_response)/sum(event_total) response, sum(event_tcp)/sum(event_total) tcp, sum(event_dns)/sum(event_total) dns, sum(event_redirect)/sum(event_total) redirect, sum(event_cache)/sum(event_total) cache, avg(event_total) avg, count(*) count FROM MultimediaViewerNetworkPerformance_10774577 WHERE event_type ='image' AND event_response IS NOT NULL AND event_request IS NOT NULL AND event_varnish1hits = 0 AND event_varnish2hits = 0 AND event_varnish3hits IS NULL AND event_timestamp - event_lastModified > 60 AND event_total > 1000;
+---------+----------+--------+--------+----------+--------+-----------+-------+
| request | response | tcp    | dns    | redirect | cache  | avg       | count |
+---------+----------+--------+--------+----------+--------+-----------+-------+
|  0.1204 |   0.7008 | 0.0791 | 0.0073 |   0.0000 | 0.0050 | 6396.3389 | 11902 |
+---------+----------+--------+--------+----------+--------+-----------+-------+

which is an error in our logging (or log aggregation) system - some requests (apparently most of the slow ones) only pass through two tiers of varnish cache and XCache will look like 
cp1067 miss (0), cp1054 frontend miss (0)
and not
cp1053 miss (0), amssq43 miss (0), amssq48 frontend miss (0)

(This might be a good time to ask, what do three varnish entries even mean? AFAIK we only have to tiers, front and back.)

The results look saner after including NULL for varnish3hits, although 10% of the time is still unaccounted and must somehow fall into one of the "gaps" of the ResourceTiming processing model, so we should probably log more fields. But I think the assertion holds that image load time is dominated by delays outside WMF's network.

(The 13% app cache time for requests with three varnish misses is still weird. That means on average over 1s cache lookup time for requests which weren't in the cache. Maybe AJAX preloading of images fails in a strange way on some browser?)
 
Anyway I don't think that requestStart is the right thing to look at to measure what's happening behind the scenes, the time the local browser has emitted the request isn't the time the server received it, nor is the time the response starts being received proof of anything about the server side of things. If varnish reported its own time breakdown through custom headers, it would truly answer that question.

It's an upper estimate; I think it is fair to assume that the request will be much faster than the response for an image which is large enough to take >5s to load. This assumes that varnish does not start sending the file before fully receiving it, which might indeed be false; I agree we should instrument the process better via headers.