Image serving performance discoveries

List overview All Threads
Download

newer

older

Swift thumbnail storage stopgap...

Joining the Communications team

Gilles Dubuc

9 Jan 2015 9 Jan '15

10:46 a.m.

Hi everyone,

I recently looked very closely at client-gathered statistics about image serving performance from within Media Viewer. Looking more specifically at the effect of thumbnail pre-rendering at upload time (which has been live for a few months) and thumbnail chaining (which was live for a few weeks and has now been turned off). The main question I was looking to answer is whether either of those techniques improved performance as experienced by users.

Chaining, when combined with pre-rendering, had no noticeable effect on performance experienced by viewers. This is logical because pre-rendering means that the thumbnail generating gains only happen at upload time, therefore clients requesting the image later won't be affected. As for the effect on image scalers load, it was so insignificant that it couldn't be measured. Chaining is probably still useful for people requesting non-standard thumbnail sizes, which I'm not measuring since I've only been looking at Media Viewer, but the priority of addressing the community concerns over JPG sharpening in order to redeploy chaining seems much lower to me now if that's the only use case chaining will be useful for.

The big discovery in my research is that we set out to do pre-rendering based on a wrong assumption. When looking at performance statistics earlier last year, we clearly saw that Varnish misses performed a lot worse than Varnish hits (well, duh) and so we set out to deploy pre-rendering the thumbnail sizes Media Viewer needs in order to get drasticaly reduce the amount of the Varnish misses. The reduction didn't happen.

The wrong assumption was that each varnish miss is a case where the thumbnail requested has to be generated on the fly by the backend. The data I've just discovered shows that this is very rare for the thumbnail sizes Media Viewer currently uses. The vast majority of Varnish misses merely pulls from Swift a thumbnail that has already been rendered at some earlier point in time and just happens to not have been requested for a while. And that Swift pull + Varnish re-add is what's making the majority of Varnish misses perform worse than hits, not the need to generate the thumbnail with ImageMagick. The bottom line is that the thumbnail prerendering provided insignificant performance gains for this set of sizes. Infrequently requested thumbnails is the main problem, not the fact that they are rendered on the fly the first time they are requested.

It seems like the only way to increase image serving performance in our current setup is to increase the expiry value in Varnish and/or increase Varnish capacity. Right now 17% of image requests in Media Viewer are Varnish misses, and 99.5% of those are pulling an existing thumbnail from Swift. Varnish misses are twice as slow as hits on average.

I plan to disable pre-rendering next week in order to confirm these findings and determine for certain what percentage of image requests pre-rendering is useful for on the set of sizes Media Viewer currently uses.

If you want to dig into the data, the relevant tables on the analytics DB are MultimediaViewerNetworkPerformance* and more specifically the event_varnish*, event_timestamp and event_lastModified columns.

Attachments:

attachment.htm (text/html — 3.4 KB)

Show replies by date

Fabrice Florin

9 Jan 9 Jan

11:43 a.m.

Thanks, Gilles, this is great research!

The lack of effectiveness of the pre-rendering and the negative impact of Varnish misses seem like really important discoveries, and can inform our plans for improving Media Viewer performance over time.

If 17% of image requests in Media Viewer are indeed Varnish misses, it would make sense to investigate a practical solution for this issue. We still get user feedback that some images take a long time to load, and the ~8-second average load time for the 95th percentile (1) seems to confirm that performance can be sluggish for many of our users.

Do we know what it would take to increase the expiry value in Varnish and/or increase Varnish capacity? What kind of costs would we be looking at to improve performance a notch? Is this largely a matter of getting more machines to keep more images in Varnish?

As discussed, I would also be happy to reach out to our friends at Flickr to ask them how they address these issues, as a large dedicated photo sharing site -- and to see if our metrics match their own. It would be good to compare notes with them anyway, if they’re interested (we could even arrange a quick visit to their office, which is only a couple blocks from us, while you’re in town).

Thanks again for this invaluable work on an important issue — it is much appreciated!

Fabrice

(1) https://docs.google.com/a/wikimedia.org/presentation/d/19coD7h2guEpiuCjR17f0... https://docs.google.com/a/wikimedia.org/presentation/d/19coD7h2guEpiuCjR17f0TogguPUvMghHFrA_u0C_yEk/edit#slide=id.g53b8901a1_8_45

...

On Jan 9, 2015, at 1:46 AM, Gilles Dubuc gilles@wikimedia.org wrote:

Hi everyone,

I recently looked very closely at client-gathered statistics about image serving performance from within Media Viewer. Looking more specifically at the effect of thumbnail pre-rendering at upload time (which has been live for a few months) and thumbnail chaining (which was live for a few weeks and has now been turned off). The main question I was looking to answer is whether either of those techniques improved performance as experienced by users.

Chaining, when combined with pre-rendering, had no noticeable effect on performance experienced by viewers. This is logical because pre-rendering means that the thumbnail generating gains only happen at upload time, therefore clients requesting the image later won't be affected. As for the effect on image scalers load, it was so insignificant that it couldn't be measured. Chaining is probably still useful for people requesting non-standard thumbnail sizes, which I'm not measuring since I've only been looking at Media Viewer, but the priority of addressing the community concerns over JPG sharpening in order to redeploy chaining seems much lower to me now if that's the only use case chaining will be useful for.

The big discovery in my research is that we set out to do pre-rendering based on a wrong assumption. When looking at performance statistics earlier last year, we clearly saw that Varnish misses performed a lot worse than Varnish hits (well, duh) and so we set out to deploy pre-rendering the thumbnail sizes Media Viewer needs in order to get drasticaly reduce the amount of the Varnish misses. The reduction didn't happen.

The wrong assumption was that each varnish miss is a case where the thumbnail requested has to be generated on the fly by the backend. The data I've just discovered shows that this is very rare for the thumbnail sizes Media Viewer currently uses. The vast majority of Varnish misses merely pulls from Swift a thumbnail that has already been rendered at some earlier point in time and just happens to not have been requested for a while. And that Swift pull + Varnish re-add is what's making the majority of Varnish misses perform worse than hits, not the need to generate the thumbnail with ImageMagick. The bottom line is that the thumbnail prerendering provided insignificant performance gains for this set of sizes. Infrequently requested thumbnails is the main problem, not the fact that they are rendered on the fly the first time they are requested.

It seems like the only way to increase image serving performance in our current setup is to increase the expiry value in Varnish and/or increase Varnish capacity. Right now 17% of image requests in Media Viewer are Varnish misses, and 99.5% of those are pulling an existing thumbnail from Swift. Varnish misses are twice as slow as hits on average.

I plan to disable pre-rendering next week in order to confirm these findings and determine for certain what percentage of image requests pre-rendering is useful for on the set of sizes Media Viewer currently uses.

If you want to dig into the data, the relevant tables on the analytics DB are MultimediaViewerNetworkPerformance* and more specifically the event_varnish*, event_timestamp and event_lastModified columns.

Multimedia mailing list Multimedia@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/multimedia

Jan Zerebecki

2:01 p.m.

New subject: [Ops] Image serving performance discoveries

Thank you for the analysis.

On 2015-01-09 10:46, Gilles Dubuc wrote:

...

The vast majority of Varnish misses merely pulls from Swift a thumbnail that has already been rendered at some earlier point in time and just happens to not have been requested for a while. And that Swift pull + Varnish re-add is what's making the majority of Varnish misses perform worse than hits, not the need to generate the thumbnail with ImageMagick.

...

It seems like the only way to increase image serving performance in our current setup is to increase the expiry value in Varnish and/or increase Varnish capacity. Right now 17% of image requests in Media Viewer are Varnish misses, and 99.5% of those are pulling an existing thumbnail from Swift. Varnish misses are twice as slow as hits on average.

Is the 2x time on the client from last request byte to first response byte or to last byte of response or even something else?

How could the setup be improved? Would directly hitting Swift from the client be better?

-- Regards, Jan Zerebecki

Gergo Tisza

8:56 p.m.

New subject: [Ops] Image serving performance discoveries

On Fri, Jan 9, 2015 at 5:01 AM, Jan Zerebecki jan.wikimedia@zerebecki.de wrote:

...

Is the 2x time on the client from last request byte to first response byte or to last byte of response or even something else?

We are (mostly) using ResourceTiming, which measures the full fetch from start to end (and the individual steps too). See the handy illustration in the spec: http://www.w3.org/TR/resource-timing/#processing-model

Gergo Tisza

8:54 p.m.

New subject: [Ops] Image serving performance discoveries

Thanks for digging into this, Gilles! It's great that we are becoming more data-driven on this.

IMO the next step should be differentiating between Varnish delay and network delay. event_response should tell us the time between first and last response byte; so Swift delay will be in event_request and network delay in event_response (or so I think, but that should be verified).

A quick first attempt:

mysql:research@analytics-store.eqiad.wmnet [log]> select sum(event_request)/sum(event_total) request, sum(event_response)/sum(event_total) response, count(*) count from MultimediaViewerNetworkPerformance_10774577 where event_type ='image' and event_response is not null and event_XCache not like '%hit%' and event_total > 5000 and date(timestamp(timestamp)) = current_date()\G *************************** 1. row *************************** request: 0.0800 response: 0.7390 count: 536

suggests that most of the time is network delay between Varnish and the browser so it is more useful to think about CDNs than about caching strategies. (Also I wonder what the missing 20% is. Seems a bit high for DNS / TCP handshakes.)

Gilles Dubuc

12 Jan 12 Jan

2:23 p.m.

New subject: [Ops] Image serving performance discoveries

...

suggests that most of the time is network delay between Varnish and the browser so it is more useful to think about CDNs than about caching strategies. (Also I wonder what the missing 20% is. Seems a bit high for DNS / TCP handshakes.)

I don't think that your query was the right way to look at Varnish misses pulling an image from Swift, consider this one, which is my attempt to fix it:

SELECT sum(event_request)/sum(event_total) request, sum(event_response)/sum(event_total) response, sum(event_tcp)/sum(event_total) tcp, count(*) count FROM MultimediaViewerNetworkPerformance_10774577 WHERE event_type ='image' AND event_response IS NOT NULL AND event_request IS NOT NULL AND event_varnish1hits = 0 AND event_varnish2hits = 0 AND event_varnish3hits = 0 AND event_timestamp - event_lastModified > 60 AND event_total > 1000

request: 0.2031 response: 0.3213 tcp: 0.0132 count: 207

Very different results and 48% of the time is unaccounted for. Considering the Resource Timing model, time could be spent in the app cache or DNS. Seems strange, but without figures it's hard to tell. Anyway I don't think that requestStart is the right thing to look at to measure what's happening behind the scenes, the time the local browser has emitted the request isn't the time the server received it, nor is the time the response starts being received proof of anything about the server side of things. If varnish reported its own time breakdown through custom headers, it would truly answer that question.

What is the worst case negative impact of this test?

...

The benefit of prerendering was too small to be measured when it was deployed, and my findings hint at why: only 0.085% of images served in media viewer should be affected by prerendering (varnish misses that don't find a pre-existing image stored in Swift). I'm turning it off to confirm the fact that basically nothing should happen and media viewer performance shouldn't change. It's as risk-free as it gets because it's only seeking confirmation, but there could always be a surprise.

Maybe images uploaded before a certain time, or similar.

...

This information has been recorded as well and was what let me confirm that pre-rendering being turned on had no measurable effect, because I was able to look at images uploaded before and after prerendering was active. It did lead me to find out about https://phabricator.wikimedia.org/T84980 but that seems to be caused by the way Varnish works. It could be part of the explanation, though. If performance can be as much as 36% worse for images freshly uploaded and added to Varnish, maybe the same phenomenon (that we can't measure with the current data because we'd need to track these stats per-image) happens again when they're re-added to Varnish from Swift. And while at first glance it looks like a smaller effect (only 36% slower at worst compared to 100% slower at worst with the "swift pull" phenomenon), if the same issue does happen on varnish re-adds, then the average figures on non-fresh uploads might included suboptimal requests affected by the same issue. Thus making the two figures separated by 36% closer than they really are if this doesn't affect only fresh uploads.

To try and describe the issue simply, data suggests that varnish needs warming up before reaching its optimal performance. The more an image is accessed, the faster it gets served by varnish. And when images fall out of varnish because they haven't been accessed in a while, the next time they are, that warmup might need to happen again. Which could be the one of the reasons why those "swift pulls" are so much slower. And things are probably made worse by the fact that those images being infrequently accessed, they're likely to fall out of varnish again before the next time they're requested. And even if they don't completely fall out, their warmup period could be stretched over a much longer period of time than frequently accessed images. Without per-image data or varnish reporting more things through custom headers, that explanation is theoretical and might not be everything that's at stake.

Federico, regarding geography and file types, the sample sizes I've been looking at for the main phenomenons shouldn't be affected by that, as the size, type and geo mix over a large amount of requests and a long period of time should be fairly consistent. Of course what matters is having large sample sizes.

FWIW here's what swift sees for time to first byte on HTTP GET in which it

...

returned 200 (99% percentile and average)

Is this the latency between Varnish and Swift? Or does it also include web servers/workers pulling originals from Swift?

On Fri, Jan 9, 2015 at 8:54 PM, Gergo Tisza gtisza@wikimedia.org wrote:

...

Thanks for digging into this, Gilles! It's great that we are becoming more data-driven on this.

IMO the next step should be differentiating between Varnish delay and network delay. event_response should tell us the time between first and last response byte; so Swift delay will be in event_request and network delay in event_response (or so I think, but that should be verified).

A quick first attempt:

mysql:research@analytics-store.eqiad.wmnet [log]> select sum(event_request)/sum(event_total) request, sum(event_response)/sum(event_total) response, count(*) count from MultimediaViewerNetworkPerformance_10774577 where event_type ='image' and event_response is not null and event_XCache not like '%hit%' and event_total > 5000 and date(timestamp(timestamp)) = current_date()\G *************************** 1. row *************************** request: 0.0800 response: 0.7390 count: 536

suggests that most of the time is network delay between Varnish and the browser so it is more useful to think about CDNs than about caching strategies. (Also I wonder what the missing 20% is. Seems a bit high for DNS / TCP handshakes.)

Filippo Giunchedi

4:28 p.m.

New subject: [Ops] Image serving performance discoveries

On Mon, Jan 12, 2015 at 1:23 PM, Gilles Dubuc gilles@wikimedia.org wrote:

...

FWIW here's what swift sees for time to first byte on HTTP GET in which it

...
returned 200 (99% percentile and average)

Is this the latency between Varnish and Swift? Or does it also include web servers/workers pulling originals from Swift?

that's every swift request, so yeah it does include e.g. imagescalers

Gergo Tisza

13 Jan 13 Jan

1:20 a.m.

New subject: [Ops] Image serving performance discoveries

On Mon, Jan 12, 2015 at 5:23 AM, Gilles Dubuc gilles@wikimedia.org wrote:

...

I don't think that your query was the right way to look at Varnish misses

...
pulling an image from Swift, consider this one, which is my attempt to fix it:

SELECT sum(event_request)/sum(event_total) request, sum(event_response)/sum(event_total) response, sum(event_tcp)/sum(event_total) tcp, count(*) count FROM MultimediaViewerNetworkPerformance_10774577 WHERE event_type ='image' AND event_response IS NOT NULL AND event_request IS NOT NULL AND event_varnish1hits = 0 AND event_varnish2hits = 0 AND event_varnish3hits = 0 AND event_timestamp - event_lastModified > 60 AND event_total > 1000

request: 0.2031 response: 0.3213 tcp: 0.0132 count: 207

Very different results and 48% of the time is unaccounted for.

The difference seems to be due to the event_varnish3hits = 0 condition:

SELECT sum(event_request)/sum(event_total) request, sum(event_response)/sum(event_total) response, sum(event_tcp)/sum(event_total) tcp, sum(event_dns)/sum(event_total) dns, sum(event_redirect)/sum(event_total) redirect, sum(event_cache)/sum(event_total) cache, avg(event_total) avg, count(*) count FROM MultimediaViewerNetworkPerformance_10774577 WHERE event_type ='image' AND event_response IS NOT NULL AND event_request IS NOT NULL AND event_varnish1hits = 0 AND event_varnish2hits = 0 AND *event_varnish3hits = 0* AND event_timestamp - event_lastModified > 60 AND event_total > 1000; +---------+----------+--------+--------+----------+--------+------------+-------+ | request | response | tcp | dns | redirect | cache | avg | count | +---------+----------+--------+--------+----------+--------+------------+-------+ | 0.2080 | 0.3245 | 0.0131 | 0.0032 | 0.0000 | 0.1307 | 11897.6429 | 224 | +---------+----------+--------+--------+----------+--------+------------+-------+

SELECT sum(event_request)/sum(event_total) request, sum(event_response)/sum(event_total) response, sum(event_tcp)/sum(event_total) tcp, sum(event_dns)/sum(event_total) dns, sum(event_redirect)/sum(event_total) redirect, sum(event_cache)/sum(event_total) cache, avg(event_total) avg, count(*) count FROM MultimediaViewerNetworkPerformance_10774577 WHERE event_type ='image' AND event_response IS NOT NULL AND event_request IS NOT NULL AND event_varnish1hits = 0 AND event_varnish2hits = 0 AND *event_varnish3hits IS NULL* AND event_timestamp - event_lastModified > 60 AND event_total > 1000; +---------+----------+--------+--------+----------+--------+-----------+-------+ | request | response | tcp | dns | redirect | cache | avg | count | +---------+----------+--------+--------+----------+--------+-----------+-------+ | 0.1204 | 0.7008 | 0.0791 | 0.0073 | 0.0000 | 0.0050 | 6396.3389 | 11902 | +---------+----------+--------+--------+----------+--------+-----------+-------+

which is an error in our logging (or log aggregation) system - some requests (apparently most of the slow ones) only pass through two tiers of varnish cache and XCache will look like cp1067 miss (0), cp1054 frontend miss (0) and not cp1053 miss (0), amssq43 miss (0), amssq48 frontend miss (0)

(This might be a good time to ask, what do three varnish entries even mean? AFAIK we only have to tiers, front and back.)

The results look saner after including NULL for varnish3hits, although 10% of the time is still unaccounted and must somehow fall into one of the "gaps" of the ResourceTiming processing model http://www.w3.org/TR/resource-timing/#processing-model, so we should probably log more fields. But I think the assertion holds that image load time is dominated by delays outside WMF's network.

(The 13% app cache time for requests with three varnish misses is still weird. That means on average over 1s cache lookup time for requests which weren't in the cache. Maybe AJAX preloading of images fails in a strange way on some browser?)

...

Anyway I don't think that requestStart is the right thing to look at to measure what's happening behind the scenes, the time the local browser has emitted the request isn't the time the server received it, nor is the time the response starts being received proof of anything about the server side of things. If varnish reported its own time breakdown through custom headers, it would truly answer that question.

It's an upper estimate; I think it is fair to assume that the request will be much faster than the response for an image which is large enough to take

...

5s to load. This assumes that varnish does not start sending the file

before fully receiving it, which might indeed be false; I agree we should instrument the process better via headers.

Brian Wolff

2:36 a.m.

New subject: [Ops] Image serving performance discoveries

...

which is an error in our logging (or log aggregation) system - some

requests (apparently most of the slow ones) only pass through two tiers of varnish cache and XCache will look like

...

cp1067 miss (0), cp1054 frontend miss (0) and not cp1053 miss (0), amssq43 miss (0), amssq48 frontend miss (0)

(This might be a good time to ask, what do three varnish entries even

mean? AFAIK we only have to tiers, front and back.)

I believe its whether you hit eqiad or not. I believe non eqiad varnish hits eventually look things up in an eqiad varnish, so in essence there are three layers for some people.

--bawolff

Alexandros Kosiaris

5:10 p.m.

New subject: [Ops] Image serving performance discoveries

...

cp1067 miss (0), cp1054 frontend miss (0) and not cp1053 miss (0), amssq43 miss (0), amssq48 frontend miss (0)

(This might be a good time to ask, what do three varnish entries even mean? AFAIK we only have to tiers, front and back.)

Two tiers per DC (caching or otherwise). However it is a design decision that the varnishes in the caching DCs (like esams where amssq43,48 live) instead of passing directly to the mediawiki appservers in the main DCs, go through the varnishes in the main DCs (eqiad right now)

-- Alexandros Kosiaris akosiaris@wikimedia.org

Federico Leva (Nemo)

8:52 a.m.

New subject: [Analytics] [Ops] Image serving performance discoveries

Gilles Dubuc, 12/01/2015 14:23:

...

Federico, regarding geography and file types, the sample sizes I've been looking at for the main phenomenons shouldn't be affected by that, as the size, type and geo mix over a large amount of requests and a long period of time should be fairly consistent. Of course what matters is having large sample sizes.

I'm not an expert in statistics, but this only matters if the two things you're observing are not correlated. As an extreme example, imagine all the varnish misses are huge GIFs used on rare articles, which for some reason get "out of varnish": then comparing them to a more balanced dataset of smaller images wouldn't give us useful information. There is certainly a reason those images are varnish misses, it's not a random thing, so IMHO you can't assume the samples are unbiased and representative. Or in other words, looking for common patterns in those varnish misses could offer more specific hints for optimisation.

Nemo

Gilles Dubuc

5:16 p.m.

New subject: [Analytics] [Ops] Image serving performance discoveries

...

looking for common patterns in those varnish misses could offer more specific hints for optimisation

Short of recording file type, we can look at content length, as it is reported by our logging. Looking at images that took > 1s to load (client time):

a) Geometric mean contentLength for all image requests: 131427 (sample size: 41729)

b) Geometric mean contentLength for Varnish misses going to Swift: 150328 (sample size: 13657) c) Geometric mean contentLength for Varnish misses having the thumbnail generated on the spot: 260860 (sample size: 39) d) Geometric mean contentLength for Varnish hits: 121982 (sample size: 27239)

And as a control let's look at a data point which shouldn't have any correlation with contentLength, like user agent:

e) Geometric mean contentLength for MSIE requests: 125130 (sample size: 1625) f) Geometric mean contentLength for Chrome requests: 129669 (sample size: 34663) g) Geometric mean contentLength for Firefox requests: 98304 (sample size: 477)

Figure c and g are clearly suffering from an insufficient sample size and figure e might too.

At first glance it's true that it seems like the data we're studying is imbalanced in terms of file size between misses and hits, by as much as 20%. This could be explained by the distribution of window sizes/requested thumbnail sizes. More on that below.

Going back to comparing misses and hits, with contentLength accounted for:

h) Geometric mean contentLength/event_total ratio for all image requests taking > 1s : 42 (sample size: 41798) i) Geometric mean contentLength/event_total ratio for Varnish misses > 1s: 48 (sample size: 13731) j) Geometric mean contentLength/event_total ratio for Varnish hits > 1s: 39 (sample size: 27285)

Things look more nuanced than before here with varnish misses being 13% slower than hits on the mean.

The fact that larger files have a higher chance of leaving the varnish cache could explain the slower experience for some users. Particularly since window size is correlated to the file size downloaded by the user, this would suggest that people with high resolution windows, if they are less numerous than people with lower resolution windows, could be penalized by hitting varnish misses more often than others. Tracking the requested thumbnail width in our performance logging would let us verify that: https://phabricator.wikimedia.org/T86609

Looking at canvas width distribution grouped by our thumbnail size buckets on the MutilmediaViewerDimensions table, there is an expected imbalance:

320 >= canvasWidth < 640: 19,991 640 >= canvasWidth < 800: 74,976 800 >= canvasWidth < 1024: 816,371 1024 >= canvasWidth < 1280: 2,034,364 1280 >= canvasWidth < 1920: 4,747,473 1920 >= canvasWidth < 2560: 77,837 2560 >= canvasWidth < 2880: 5508 canvasWidth >= 2880: 1189

The sampling factor for that table is set to 1000 and these values are across the entire data, going all the way back to the beginning of October. Which means that for example on average users hitting the 2880 bucket (when available, not all images have a greater width than 2880) are responsible for around 12000 image views per day. That's not much for collectively visiting enough different images to make them stay in Varnish for that size.

There's a tough choice to make here for large window sizes. People with large screen resolutions probably hit varnish misses quite often. Would these people prefer to wait the 10-31% extra time (see later in this message where that figure comes from) required by a miss to see a better quality image? Or would they prefer getting an lower quality image faster?

As for small window sizes, the question becomes for example whether it's faster or not to get a 640px miss than a 800px hit, looking only at event_total and not contentLength. We'll be able to answer that question once we log thumbnail width in the performance log.

Now, in the above figures I was only looking at requests taking more than 1 second, whereas our dashboards look at all requests taking more than 20ms. 20ms was just an educated guess to rule out requests that only hit the local cache, but it's not perfect.

If we look at requests taking more than 1 second in the same fashion we currently do for the dashboards:

type datestring mean standard_deviation sample_size 1st_percentile 50th_percentile 90th_percentile 99th_percentile miss 2015-01-12 2477.970860632991 2.25631950919532 8959 1014 2005 7193 12890 45738 hit 2015-01-12 2582.869954939406 2.3772863410541945 16631 1012 2023 8591 15191 50792

The same figures for requests between 20ms and 1s: type datestring mean standard_deviation sample_size 1st_percentile 50th_percentile 90th_percentile 99th_percentile miss 2015-01-12 421.22199393288224 1.9358048181852694 10081 35 469 840 919 980 hit 2015-01-12 283.74019151607365 2.3467070754790877 36720 28 326 782 882 975

And the same figures for requests > 20ms altogether: type datestring mean standard_deviation sample_size 1st_percentile 50th_percentile 90th_percentile 99th_percentile miss 2015-01-12 969.6876199131866 3.160975218030068 19045 48 907 3742 6815 28882 hit 2015-01-12 564.9235984534097 3.7979095214860923 53365 31 535 2966 5624 20967

Note that this differs from the live graphs because the queries need to be fixed a little, following Gergo's finding about the 3rd varnish column: https://phabricator.wikimedia.org/T86675

A possible explanation could be that for requests taking more than a second, the user's bandwidth problems outweigh our response time - regardless of it being a varnish hit or miss - by so much that our own performance becomes almost irrelevant in the total figure. I looked at more data taking contentLength into account to verify that assumption and it seems to hold true:

- for requests taking between 0.5s and 1s, varnish misses perform 31% worse than hits - for requests taking between 1s and 2s, varnish misses perform 25% worse than hits - for requests taking between 2s and 3s, varnish misses perform 20% worse than hits - for requests taking between 3s and 4s, varnish misses perform 18% worse than hits - for requests taking between 4s and 5s, varnish misses perform 11% worse than hits - for requests taking longer than 5s, varnish misses perform 10% worse than hits

The long tail is clearly responsible for the 13% figure that came up above for requests taking longer than 1s. If one multiplies that worse performance ratio to the midpoint of the duration range (eg. 20% * 2.5) as a very wild guesstimate of how long time is spent on the swift pull for the miss, the figure is between 200ms and 700ms. This doesn't seem too far-fetched, considering that we saw earlier that the mean swift latency is around 100ms.

Since browser cache hits might still be messing with results between 20ms and 1s, we should find a more robust way to filter those out of the performance measurement: https://phabricator.wikimedia.org/T86672 If we can filter those out properly we should be able to confirm this hunch about miss performance's share of responsibility in the amount of time it takes to load an image.

On Tue, Jan 13, 2015 at 8:52 AM, Federico Leva (Nemo) nemowiki@gmail.com wrote:

...

Gilles Dubuc, 12/01/2015 14:23:

...
Federico, regarding geography and file types, the sample sizes I've been looking at for the main phenomenons shouldn't be affected by that, as the size, type and geo mix over a large amount of requests and a long period of time should be fairly consistent. Of course what matters is having large sample sizes.

I'm not an expert in statistics, but this only matters if the two things you're observing are not correlated. As an extreme example, imagine all the varnish misses are huge GIFs used on rare articles, which for some reason get "out of varnish": then comparing them to a more balanced dataset of smaller images wouldn't give us useful information. There is certainly a reason those images are varnish misses, it's not a random thing, so IMHO you can't assume the samples are unbiased and representative. Or in other words, looking for common patterns in those varnish misses could offer more specific hints for optimisation.

Nemo

Multimedia mailing list Multimedia@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/multimedia

Brian Wolff

10 Jan 10 Jan

12:28 a.m.

On Jan 9, 2015 5:46 AM, "Gilles Dubuc" gilles@wikimedia.org wrote:

...

Hi everyone,

I recently looked very closely at client-gathered statistics about image

serving performance from within Media Viewer. Looking more specifically at the effect of thumbnail pre-rendering at upload time (which has been live for a few months) and thumbnail chaining (which was live for a few weeks and has now been turned off). The main question I was looking to answer is whether either of those techniques improved performance as experienced by users.

...

Chaining, when combined with pre-rendering, had no noticeable effect on

performance experienced by viewers. This is logical because pre-rendering means that the thumbnail generating gains only happen at upload time, therefore clients requesting the image later won't be affected. As for the effect on image scalers load, it was so insignificant that it couldn't be measured. Chaining is probably still useful for people requesting non-standard thumbnail sizes, which I'm not measuring since I've only been looking at Media Viewer, but the priority of addressing the community concerns over JPG sharpening in order to redeploy chaining seems much lower to me now if that's the only use case chaining will be useful for.

...

The big discovery in my research is that we set out to do pre-rendering

based on a wrong assumption. When looking at performance statistics earlier last year, we clearly saw that Varnish misses performed a lot worse than Varnish hits (well, duh) and so we set out to deploy pre-rendering the thumbnail sizes Media Viewer needs in order to get drasticaly reduce the amount of the Varnish misses. The reduction didn't happen.

...

The wrong assumption was that each varnish miss is a case where the

thumbnail requested has to be generated on the fly by the backend. The data I've just discovered shows that this is very rare for the thumbnail sizes Media Viewer currently uses. The vast majority of Varnish misses merely pulls from Swift a thumbnail that has already been rendered at some earlier point in time and just happens to not have been requested for a while. And that Swift pull + Varnish re-add is what's making the majority of Varnish misses perform worse than hits, not the need to generate the thumbnail with ImageMagick. The bottom line is that the thumbnail prerendering provided insignificant performance gains for this set of sizes. Infrequently requested thumbnails is the main problem, not the fact that they are rendered on the fly the first time they are requested.

...

It seems like the only way to increase image serving performance in our

current setup is to increase the expiry value in Varnish and/or increase Varnish capacity. Right now 17% of image requests in Media Viewer are Varnish misses, and 99.5% of those are pulling an existing thumbnail from Swift. Varnish misses are twice as slow as hits on average.

...

I plan to disable pre-rendering next week in order to confirm these

findings and determine for certain what percentage of image requests pre-rendering is useful for on the set of sizes Media Viewer currently uses.

...

If you want to dig into the data, the relevant tables on the analytics DB

are MultimediaViewerNetworkPerformance* and more specifically the event_varnish*, event_timestamp and event_lastModified columns.

...

Multimedia mailing list Multimedia@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/multimedia

Do we have any information on what latency is acceptable to the user (i know "acceptability" is not a binary state, and any latency is bad, but it would be good to have a target). Also it would be interesting to know how latency acceptability differs between media viewer and other uses of images such as click to image page.

For example if users consider even the latency from a varnish hit unacceptable than we have a larger problem, or maybe the latency doesnt have to go all the way down to varnish levels for 90% of our users to be happy.

--bawolff

Gergo Tisza

11 Jan 11 Jan

12:20 a.m.

On Fri, Jan 9, 2015 at 3:28 PM, Brian Wolff bawolff@gmail.com wrote:

...

Also it would be interesting to know how latency acceptability differs between media viewer and other uses of images such as click to image page.

I would love to hear ideas on how to find that out. We have several proposals about improving loading speed for MediaViewer or tweaking things that might contribute to the subjective feeling of speed, but they are blocked because I have no idea how to make objective comparisons for subjective impressions of latency.

Erik Moeller

10 Jan 10 Jan

12:36 a.m.

New subject: [Ops] Image serving performance discoveries

On Fri, Jan 9, 2015 at 1:46 AM, Gilles Dubuc gilles@wikimedia.org wrote:

...

I plan to disable pre-rendering next week in order to confirm these findings and determine for certain what percentage of image requests pre-rendering is useful for on the set of sizes Media Viewer currently uses.

What is the worst case negative impact of this test? I want to make sure that if we perform a test that could regress performance, the impact is either inherently low enough or that the test itself is conditioned to constrain it.

Thanks, Erik

-- Erik Möller VP of Product & Strategy, Wikimedia Foundation

Federico Leva (Nemo)

11 Jan 11 Jan

9:18 a.m.

Gilles Dubuc, 09/01/2015 10:46:

...

Right now 17% of image requests in Media Viewer are Varnish misses, and 99.5% of those are pulling an existing thumbnail from Swift. Varnish misses are twice as slow as hits on average.

Very interesting.

...

I plan to disable pre-rendering next week in order to confirm these findings and determine for certain what percentage of image requests pre-rendering is useful for on the set of sizes Media Viewer currently uses.

Isn't there another way to differentiate the images affected pre-rendering? Maybe images uploaded before a certain time, or similar.

Gergo Tisza, 09/01/2015 20:54:

...

IMO the next step should be differentiating between Varnish delay and network delay.

Or ensuring that we're not comparing apples and oranges. Considering thumbnails of similar size and format would be an improvement. Some formats, like GIF and very detailed PNGs, are both rarer and significantly heavier (possibly also slower to parse).

...

suggests that most of the time is network delay between Varnish and the browser so it is more useful to think about CDNs than about caching strategies.

Hm. If this were true, would it mean that varnish misses correlate to geography?

...

(Also I wonder what the missing 20% is. Seems a bit high for DNS / TCP handshakes.)

Those are a substantial portion of load time on many of our pages which load resources from multiple domains (typically, current + bits + upload + login + meta for CentralNotice + another wiki for CSS/JS). The connection view of http://www.webpagetest.org/result/140605_N7_CSX/1/details/ seems a typical example.

Nemo

Filippo Giunchedi

12 Jan 12 Jan

10:50 a.m.

New subject: [Ops] Image serving performance discoveries

On Fri, Jan 9, 2015 at 9:46 AM, Gilles Dubuc gilles@wikimedia.org wrote:

...

The bottom line is that the thumbnail prerendering provided insignificant performance gains for this set of sizes. Infrequently requested thumbnails is the main problem, not the fact that they are rendered on the fly the first time they are requested.

It seems like the only way to increase image serving performance in our current setup is to increase the expiry value in Varnish and/or increase Varnish capacity. Right now 17% of image requests in Media Viewer are Varnish misses, and 99.5% of those are pulling an existing thumbnail from Swift. Varnish misses are twice as slow as hits on average.

FWIW here's what swift sees for time to first byte on HTTP GET in which it returned 200 (99% percentile and average)

https://graphite.wikimedia.org/render/?width=672&height=381&_salt=14...)

3629

Age (days ago)

3633

Last active (days ago)

multimedia@lists.wikimedia.org

16 comments

9 participants

tags (0)

participants (9)

Alexandros Kosiaris
Brian Wolff
Erik Moeller
Fabrice Florin
Federico Leva (Nemo)
Filippo Giunchedi
Gergo Tisza
Gilles Dubuc
Jan Zerebecki