looking for common patterns in those varnish misses could offer more specific hints for optimisation

Short of recording file type, we can look at content length, as it is reported by our logging. Looking at images that took > 1s to load (client time):

a) Geometric mean contentLength for all image requests: 131427 (sample size: 41729)

b) Geometric mean contentLength for Varnish misses going to Swift: 150328 (sample size: 13657)
c) Geometric mean contentLength for Varnish misses having the thumbnail generated on the spot: 260860 (sample size: 39)
d) Geometric mean contentLength for Varnish hits: 121982 (sample size: 27239)

And as a control let's look at a data point which shouldn't have any correlation with contentLength, like user agent:

e) Geometric mean contentLength for MSIE requests: 125130 (sample size: 1625)
f) Geometric mean contentLength for Chrome requests: 129669 (sample size: 34663)
g) Geometric mean contentLength for Firefox requests: 98304 (sample size: 477)

Figure c and g are clearly suffering from an insufficient sample size and figure e might too.

At first glance it's true that it seems like the data we're studying is imbalanced in terms of file size between misses and hits, by as much as 20%. This could be explained by the distribution of window sizes/requested thumbnail sizes. More on that below.

Going back to comparing misses and hits, with contentLength accounted for:

h) Geometric mean contentLength/event_total ratio for all image requests taking > 1s : 42 (sample size: 41798)
i) Geometric mean contentLength/event_total ratio for Varnish misses > 1s: 48 (sample size: 13731)
j) Geometric mean contentLength/event_total ratio for Varnish hits > 1s: 39 (sample size: 27285)

Things look more nuanced than before here with varnish misses being 13% slower than hits on the mean.

The fact that larger files have a higher chance of leaving the varnish cache could explain the slower experience for some users. Particularly since window size is correlated to the file size downloaded by the user, this would suggest that people with high resolution windows, if they are less numerous than people with lower resolution windows, could be penalized by hitting varnish misses more often than others. Tracking the requested thumbnail width in our performance logging would let us verify that: https://phabricator.wikimedia.org/T86609

Looking at canvas width distribution grouped by our thumbnail size buckets on the MutilmediaViewerDimensions table, there is an expected imbalance:

320 >= canvasWidth < 640: 19,991
640 >= canvasWidth < 800: 74,976
800 >= canvasWidth < 1024: 816,371
1024 >= canvasWidth < 1280: 2,034,364
1280 >= canvasWidth < 1920: 4,747,473
1920 >= canvasWidth < 2560: 77,837
2560 >= canvasWidth < 2880: 5508
canvasWidth >= 2880: 1189

The sampling factor for that table is set to 1000 and these values are across the entire data, going all the way back to the beginning of October. Which means that for example on average users hitting the 2880 bucket (when available, not all images have a greater width than 2880) are responsible for around 12000 image views per day. That's not much for collectively visiting enough different images to make them stay in Varnish for that size.

There's a tough choice to make here for large window sizes. People with large screen resolutions probably hit varnish misses quite often. Would these people prefer to wait the 10-31% extra time (see later in this message where that figure comes from) required by a miss to see a better quality image? Or would they prefer getting an lower quality image faster?

As for small window sizes, the question becomes for example whether it's faster or not to get a 640px miss than a 800px hit, looking only at event_total and not contentLength. We'll be able to answer that question once we log thumbnail width in the performance log.

Now, in the above figures I was only looking at requests taking more than 1 second, whereas our dashboards look at all requests taking more than 20ms. 20ms was just an educated guess to rule out requests that only hit the local cache, but it's not perfect.

If we look at requests taking more than 1 second in the same fashion we currently do for the dashboards:

type datestring mean standard_deviation sample_size 1st_percentile 50th_percentile 90th_percentile 99th_percentile
miss 2015-01-12    2477.970860632991    2.25631950919532    8959    1014    2005    7193    12890    45738
hit 2015-01-12    2582.869954939406    2.3772863410541945    16631    1012    2023    8591    15191    50792

The same figures for requests between 20ms and 1s:
type datestring mean standard_deviation sample_size 1st_percentile 50th_percentile 90th_percentile 99th_percentile
miss 2015-01-12    421.22199393288224    1.9358048181852694    10081    35    469    840    919    980
hit 2015-01-12    283.74019151607365    2.3467070754790877    36720    28    326    782    882    975

And the same figures for requests > 20ms altogether:
type datestring mean standard_deviation sample_size 1st_percentile 50th_percentile 90th_percentile 99th_percentile
miss 2015-01-12    969.6876199131866    3.160975218030068    19045    48    907    3742    6815    28882
hit 2015-01-12    564.9235984534097    3.7979095214860923    53365    31    535    2966    5624    20967

Note that this differs from the live graphs because the queries need to be fixed a little, following Gergo's finding about the 3rd varnish column: https://phabricator.wikimedia.org/T86675

A possible explanation could be that for requests taking more than a second, the user's bandwidth problems outweigh our response time - regardless of it being a varnish hit or miss - by so much that our own performance becomes almost irrelevant in the total figure. I looked at more data taking contentLength into account to verify that assumption and it seems to hold true:

- for requests taking between 0.5s and 1s, varnish misses perform 31% worse than hits
- for requests taking between 1s and 2s, varnish misses perform 25% worse than hits
- for requests taking between 2s and 3s, varnish misses perform 20% worse than hits
- for requests taking between 3s and 4s, varnish misses perform 18% worse than hits
- for requests taking between 4s and 5s, varnish misses perform 11% worse than hits
- for requests taking longer than 5s, varnish misses perform 10% worse than hits

The long tail is clearly responsible for the 13% figure that came up above for requests taking longer than 1s. If one multiplies that worse performance ratio to the midpoint of the duration range (eg. 20% * 2.5) as a very wild guesstimate of how long time is spent on the swift pull for the miss, the figure is between 200ms and 700ms. This doesn't seem too far-fetched, considering that we saw earlier that the mean swift latency is around 100ms.

Since browser cache hits might still be messing with results between 20ms and 1s, we should find a more robust way to filter those out of the performance measurement: https://phabricator.wikimedia.org/T86672 If we can filter those out properly we should be able to confirm this hunch about miss performance's share of responsibility in the amount of time it takes to load an image.

On Tue, Jan 13, 2015 at 8:52 AM, Federico Leva (Nemo) <nemowiki@gmail.com> wrote:
Gilles Dubuc, 12/01/2015 14:23:
Federico, regarding geography and file types, the sample sizes I've been
looking at for the main phenomenons shouldn't be affected by that, as
the size, type and geo mix over a large amount of requests and a long
period of time should be fairly consistent. Of course what matters is
having large sample sizes.

I'm not an expert in statistics, but this only matters if the two things you're observing are not correlated. As an extreme example, imagine all the varnish misses are huge GIFs used on rare articles, which for some reason get "out of varnish": then comparing them to a more balanced dataset of smaller images wouldn't give us useful information. There is certainly a reason those images are varnish misses, it's not a random thing, so IMHO you can't assume the samples are unbiased and representative. Or in other words, looking for common patterns in those varnish misses could offer more specific hints for optimisation.


Nemo

_______________________________________________
Multimedia mailing list
Multimedia@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/multimedia