looking for common patterns in those varnish misses could offer more specific hints for optimisation

Short of recording file type, we can look at content length, as it is reported by our logging. Looking at images that took > 1s to load (client time):

a) Geometric mean contentLength for all image requests: 131427 (sample size: 41729)

b) Geometric mean contentLength for Varnish misses going to Swift: 150328 (sample size: 13657)

c) Geometric mean contentLength for Varnish misses having the thumbnail generated on the spot: 260860 (sample size: 39)
d) Geometric mean contentLength for Varnish hits: 121982 (sample size: 27239)

And as a control let's look at a data point which shouldn't have any correlation with contentLength, like user agent:

e) Geometric mean contentLength for MSIE requests: 125130 (sample size: 1625)
f) Geometric mean contentLength for Chrome requests: 129669 (sample size: 34663)
g) Geometric mean contentLength for Firefox requests: 98304 (sample size: 477)

Figure c and g are clearly suffering from an insufficient sample size and figure e might too.

At first glance it's true that it seems like the data we're studying is imbalanced in terms of file size between misses and hits, by as much as 20%. This could be explained by the distribution of window sizes/requested thumbnail sizes. More on that below.

Going back to comparing misses and hits, with contentLength accounted for:

h) Geometric mean contentLength/event_total ratio for all image requests taking > 1s : 42 (sample size: 41798)

i) Geometric mean contentLength/event_total ratio for Varnish misses > 1s: 48 (sample size: 13731)
j) Geometric mean contentLength/event_total ratio for Varnish hits > 1s: 39 (sample size: 27285)

Things look more nuanced than before here with varnish misses being 13% slower than hits on the mean.

The fact that larger files have a higher chance of leaving the varnish cache could explain the slower experience for some users. Particularly since window size is correlated to the file size downloaded by the user, this would suggest that people with high resolution windows, if they are less numerous than people with lower resolution windows, could be penalized by hitting varnish misses more often than others. Tracking the requested thumbnail width in our performance logging would let us verify that: https://phabricator.wikimedia.org/T86609

Looking at canvas width distribution grouped by our thumbnail size buckets on the MutilmediaViewerDimensions table, there is an expected imbalance:

320 >= canvasWidth < 640: 19,991
640 >= canvasWidth < 800: 74,976

800 >= canvasWidth < 1024: 816,371

1024 >= canvasWidth < 1280: 2,034,364

1280 >= canvasWidth < 1920: 4,747,473

1920 >= canvasWidth < 2560: 77,837

2560 >= canvasWidth < 2880: 5508

canvasWidth >= 2880: 1189

The sampling factor for that table is set to 1000 and these values are across the entire data, going all the way back to the beginning of October. Which means that for example on average users hitting the 2880 bucket (when available, not all images have a greater width than 2880) are responsible for around 12000 image views per day. That's not much for collectively visiting enough different images to make them stay in Varnish for that size.

There's a tough choice to make here for large window sizes. People with large screen resolutions probably hit varnish misses quite often. Would these people prefer to wait the 10-31% extra time (see later in this message where that figure comes from) required by a miss to see a better quality image? Or would they prefer getting an lower quality image faster?

As for small window sizes, the question becomes for example whether it's faster or not to get a 640px miss than a 800px hit, looking only at event_total and not contentLength. We'll be able to answer that question once we log thumbnail width in the performance log.

Now, in the above figures I was only looking at requests taking more than 1 second, whereas our dashboards look at all requests taking more than 20ms. 20ms was just an educated guess to rule out requests that only hit the local cache, but it's not perfect.

If we look at requests taking more than 1 second in the same fashion we currently do for the dashboards:

type datestring mean standard_deviation sample_size 1st_percentile 50th_percentile 90th_percentile 99th_percentile

miss 2015-01-12 2477.970860632991 2.25631950919532 8959 1014 2005 7193 12890 45738

hit 2015-01-12 2582.869954939406 2.3772863410541945 16631 1012 2023 8591 15191 50792

The same figures for requests between 20ms and 1s:
type datestring mean standard_deviation sample_size 1st_percentile 50th_percentile 90th_percentile 99th_percentile

miss 2015-01-12 421.22199393288224 1.9358048181852694 10081 35 469 840 919 980

hit 2015-01-12 283.74019151607365 2.3467070754790877 36720 28 326 782 882 975

And the same figures for requests > 20ms altogether:
type datestring mean standard_deviation sample_size 1st_percentile 50th_percentile 90th_percentile 99th_percentile

miss 2015-01-12 969.6876199131866 3.160975218030068 19045 48 907 3742 6815 28882

hit 2015-01-12 564.9235984534097 3.7979095214860923 53365 31 535 2966 5624 20967

Note that this differs from the live graphs because the queries need to be fixed a little, following Gergo's finding about the 3rd varnish column: https://phabricator.wikimedia.org/T86675

A possible explanation could be that for requests taking more than a second, the user's bandwidth problems outweigh our response time - regardless of it being a varnish hit or miss - by so much that our own performance becomes almost irrelevant in the total figure. I looked at more data taking contentLength into account to verify that assumption and it seems to hold true:

- for requests taking between 0.5s and 1s, varnish misses perform 31% worse than hits

- for requests taking between 1s and 2s, varnish misses perform 25% worse than hits

- for requests taking between 2s and 3s, varnish misses perform 20% worse than hits
- for requests taking between 3s and 4s, varnish misses perform 18% worse than hits
- for requests taking between 4s and 5s, varnish misses perform 11% worse than hits
- for requests taking longer than 5s, varnish misses perform 10% worse than hits

The long tail is clearly responsible for the 13% figure that came up above for requests taking longer than 1s. If one multiplies that worse performance ratio to the midpoint of the duration range (eg. 20% * 2.5) as a very wild guesstimate of how long time is spent on the swift pull for the miss, the figure is between 200ms and 700ms. This doesn't seem too far-fetched, considering that we saw earlier that the mean swift latency is around 100ms.

Since browser cache hits might still be messing with results between 20ms and 1s, we should find a more robust way to filter those out of the performance measurement: https://phabricator.wikimedia.org/T86672 If we can filter those out properly we should be able to confirm this hunch about miss performance's share of responsibility in the amount of time it takes to load an image.