Re: [Analytics] [Multimedia] [Ops] Image serving performance discoveries

13 Jan 2015

...

 looking for common patterns in those varnish misses could offer more
 specific hints for optimisation

Short of recording file type, we can look at content length, as it is
reported by our logging. Looking at images that took > 1s to load (client
time):

a) Geometric mean contentLength for all image requests: 131427 (sample
size: 41729)

b) Geometric mean contentLength for Varnish misses going to Swift: 150328
(sample size: 13657)
c) Geometric mean contentLength for Varnish misses having the thumbnail
generated on the spot: 260860 (sample size: 39)
d) Geometric mean contentLength for Varnish hits: 121982 (sample size:
27239)

And as a control let's look at a data point which shouldn't have any
correlation with contentLength, like user agent:

e) Geometric mean contentLength for MSIE requests: 125130 (sample size:
1625)
f) Geometric mean contentLength for Chrome requests: 129669 (sample size:
34663)
g) Geometric mean contentLength for Firefox requests: 98304 (sample size:
477)

Figure c and g are clearly suffering from an insufficient sample size and
figure e might too.

At first glance it's true that it seems like the data we're studying is
imbalanced in terms of file size between misses and hits, by as much as
20%. This could be explained by the distribution of window sizes/requested
thumbnail sizes. More on that below.

Going back to comparing misses and hits, with contentLength accounted for:

h) Geometric mean contentLength/event_total ratio for all image requests
taking > 1s : 42 (sample size: 41798)
i) Geometric mean contentLength/event_total ratio for Varnish misses > 1s:
48 (sample size: 13731)
j) Geometric mean contentLength/event_total ratio for Varnish hits > 1s: 39
(sample size: 27285)

Things look more nuanced than before here with varnish misses being 13%
slower than hits on the mean.

The fact that larger files have a higher chance of leaving the varnish
cache could explain the slower experience for some users. Particularly
since window size is correlated to the file size downloaded by the user,
this would suggest that people with high resolution windows, if they are
less numerous than people with lower resolution windows, could be penalized
by hitting varnish misses more often than others. Tracking the requested
thumbnail width in our performance logging would let us verify that:
https://phabricator.wikimedia.org/T86609

Looking at canvas width distribution grouped by our thumbnail size buckets
on the MutilmediaViewerDimensions table, there is an expected imbalance:

320 >= canvasWidth < 640: 19,991
640 >= canvasWidth < 800: 74,976
800 >= canvasWidth < 1024: 816,371
1024 >= canvasWidth < 1280: 2,034,364
1280 >= canvasWidth < 1920: 4,747,473
1920 >= canvasWidth < 2560: 77,837
2560 >= canvasWidth < 2880: 5508
canvasWidth >= 2880: 1189

The sampling factor for that table is set to 1000 and these values are
across the entire data, going all the way back to the beginning of October.
Which means that for example on average users hitting the 2880 bucket (when
available, not all images have a greater width than 2880) are responsible
for around 12000 image views per day. That's not much for collectively
visiting enough different images to make them stay in Varnish for that size.

There's a tough choice to make here for large window sizes. People with
large screen resolutions probably hit varnish misses quite often. Would
these people prefer to wait the 10-31% extra time (see later in this
message where that figure comes from) required by a miss to see a better
quality image? Or would they prefer getting an lower quality image faster?

As for small window sizes, the question becomes for example whether it's
faster or not to get a 640px miss than a 800px hit, looking only at
event_total and not contentLength. We'll be able to answer that question
once we log thumbnail width in the performance log.

Now, in the above figures I was only looking at requests taking more than 1
second, whereas our dashboards look at all requests taking more than 20ms.
20ms was just an educated guess to rule out requests that only hit the
local cache, but it's not perfect.

If we look at requests taking more than 1 second in the same fashion we
currently do for the dashboards:

type datestring mean standard_deviation sample_size 1st_percentile
50th_percentile 90th_percentile 99th_percentile
miss 2015-01-12    2477.970860632991    2.25631950919532    8959    1014
2005    7193    12890    45738
hit 2015-01-12    2582.869954939406    2.3772863410541945    16631
1012    2023    8591    15191    50792

The same figures for requests between 20ms and 1s:
type datestring mean standard_deviation sample_size 1st_percentile
50th_percentile 90th_percentile 99th_percentile
miss 2015-01-12    421.22199393288224    1.9358048181852694    10081
35    469    840    919    980
hit 2015-01-12    283.74019151607365    2.3467070754790877    36720
28    326    782    882    975

And the same figures for requests > 20ms altogether:
type datestring mean standard_deviation sample_size 1st_percentile
50th_percentile 90th_percentile 99th_percentile
miss 2015-01-12    969.6876199131866    3.160975218030068    19045    48
907    3742    6815    28882
hit 2015-01-12    564.9235984534097    3.7979095214860923    53365    31
535    2966    5624    20967

Note that this differs from the live graphs because the queries need to be
fixed a little, following Gergo's finding about the 3rd varnish column:
https://phabricator.wikimedia.org/T86675

A possible explanation could be that for requests taking more than a
second, the user's bandwidth problems outweigh our response time -
regardless of it being a varnish hit or miss - by so much that our own
performance becomes almost irrelevant in the total figure. I looked at more
data taking contentLength into account to verify that assumption and it
seems to hold true:

- for requests taking between 0.5s and 1s, varnish misses perform 31% worse
than hits
- for requests taking between 1s and 2s, varnish misses perform 25% worse
than hits
- for requests taking between 2s and 3s, varnish misses perform 20% worse
than hits
- for requests taking between 3s and 4s, varnish misses perform 18% worse
than hits
- for requests taking between 4s and 5s, varnish misses perform 11% worse
than hits
- for requests taking longer than 5s, varnish misses perform 10% worse than
hits

The long tail is clearly responsible for the 13% figure that came up above
for requests taking longer than 1s. If one multiplies that worse
performance ratio to the midpoint of the duration range (eg. 20% * 2.5) as
a very wild guesstimate of how long time is spent on the swift pull for the
miss, the figure is between 200ms and 700ms. This doesn't seem too
far-fetched, considering that we saw earlier that the mean swift latency is
around 100ms.

Since browser cache hits might still be messing with results between 20ms
and 1s, we should find a more robust way to filter those out of the
performance measurement: https://phabricator.wikimedia.org/T86672 If we can
filter those out properly we should be able to confirm this hunch about
miss performance's share of responsibility in the amount of time it takes
to load an image.

On Tue, Jan 13, 2015 at 8:52 AM, Federico Leva (Nemo) &lt;nemowiki(a)gmail.com&gt;
wrote:

...
  Gilles Dubuc, 12/01/2015 14:23:

  Federico, regarding geography and file types, the
sample sizes I've been
 looking at for the main phenomenons shouldn't be affected by that, as
 the size, type and geo mix over a large amount of requests and a long
 period of time should be fairly consistent. Of course what matters is
 having large sample sizes.

 I'm not an expert in statistics, but this only matters if the two things
 you're observing are not correlated. As an extreme example, imagine all the
 varnish misses are huge GIFs used on rare articles, which for some reason
 get "out of varnish": then comparing them to a more balanced dataset of
 smaller images wouldn't give us useful information. There is certainly a
 reason those images are varnish misses, it's not a random thing, so IMHO
 you can't assume the samples are unbiased and representative. Or in other
 words, looking for common patterns in those varnish misses could offer more
 specific hints for optimisation.

 Nemo

 _______________________________________________
 Multimedia mailing list
 Multimedia(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/multimedia

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Re: [Analytics] [Multimedia] [Ops] Image serving performance discoveries