looking for common patterns in those varnish misses could offer more specific hints for optimisation
Short of recording file type, we can look at content length, as it is reported by our logging. Looking at images that took > 1s to load (client time):
a) Geometric mean contentLength for all image requests: 131427 (sample size: 41729)
b) Geometric mean contentLength for Varnish misses going to Swift: 150328 (sample size: 13657)
c) Geometric mean contentLength for Varnish misses having the thumbnail generated on the spot: 260860 (sample size: 39)
d) Geometric mean contentLength for Varnish hits: 121982 (sample size: 27239)
And as a control let's look at a data point which shouldn't have any correlation with contentLength, like user agent:
e) Geometric mean contentLength for MSIE requests: 125130 (sample size: 1625)
f) Geometric mean contentLength for Chrome requests: 129669 (sample size: 34663)
g) Geometric mean contentLength for Firefox requests: 98304 (sample size: 477)
Figure c and g are clearly suffering from an insufficient sample size and figure e might too.
At first glance it's true that it seems like the data we're studying is imbalanced in terms of file size between misses and hits, by as much as 20%. This could be explained by the distribution of window sizes/requested thumbnail sizes. More on that below.
Going back to comparing misses and hits, with contentLength accounted for:
h) Geometric mean contentLength/event_total ratio for all image requests taking > 1s : 42 (sample size: 41798)
i) Geometric mean contentLength/event_total ratio for Varnish misses > 1s: 48 (sample size: 13731)
j) Geometric mean contentLength/event_total ratio for Varnish hits > 1s: 39 (sample size: 27285)
Things look more nuanced than before here with varnish misses being 13% slower than hits on the mean.
The fact that larger files have a higher chance of leaving the varnish cache could explain the slower experience for some users. Particularly since window size is correlated to the file size downloaded by the user, this would suggest that people with high resolution windows, if they are less numerous than people with lower resolution windows, could be penalized by hitting varnish misses more often than others. Tracking the requested thumbnail width in our performance logging would let us verify that:
https://phabricator.wikimedia.org/T86609Looking at canvas width distribution grouped by our thumbnail size buckets on the MutilmediaViewerDimensions table, there is an expected imbalance:
320 >= canvasWidth < 640: 19,991
640 >= canvasWidth < 800: 74,976
800 >= canvasWidth < 1024: 816,371
1024 >= canvasWidth < 1280: 2,034,364
1280 >= canvasWidth < 1920: 4,747,473
1920 >= canvasWidth < 2560: 77,837
2560 >= canvasWidth < 2880: 5508
canvasWidth >= 2880: 1189
The sampling factor for that table is set to 1000 and these values are across the entire data, going all the way back to the beginning of October. Which means that for example on average users hitting the 2880 bucket (when available, not all images have a greater width than 2880) are responsible for around 12000 image views per day. That's not much for collectively visiting enough different images to make them stay in Varnish for that size.
There's a tough choice to make here for large window sizes. People with large screen resolutions probably hit varnish misses quite often. Would these people prefer to wait the 10-31% extra time (see later in this message where that figure comes from) required by a miss to see a better quality image? Or would they prefer getting an lower quality image faster?
As for small window sizes, the question becomes for example whether it's faster or not to get a 640px miss than a 800px hit, looking only at event_total and not contentLength. We'll be able to answer that question once we log thumbnail width in the performance log.
Now, in the above figures I was only looking at requests taking more than 1 second, whereas our dashboards look at all requests taking more than 20ms. 20ms was just an educated guess to rule out requests that only hit the local cache, but it's not perfect.
If we look at requests taking more than 1 second in the same fashion we currently do for the dashboards: