Just to get a sense of the scale, "non-standard" sizes > 1280 represent approximately 2 TB of Swift storage at the moment. And all sizes <= 1280 (where we can't tell "non-standard/standard" apart) represent approximately 16 TB. As for "standard sizes" > 1280, they total around 1.6 TB.
It's hard to estimate how much we're looking to save on sizes < 1280 due to the issue I've described earlier. But it's probably something expressed in terabytes.
Filippo told me that the space I've just mentioned doesn't take into account the swift replication (currently 3 copies). Which means that we're currently talking about three times as much physical storage space.
I've looked at the amount of hits for sizes > 1280 and "non-standard" thumbnails are viewed 3.3 times less than "standard ones". That means some strange sizes are getting a decent amount of traffic, but I haven't looked at the distribution yet to see if there are some sizes that clearly stand out and might be "standard" sizes which we don't know about lurking in there.
I've attached Filippo's CSV dumps, so that everyone can have fun at home extracting meaning from that data.
For reference, this is the list of "standard" sizes we've come up with, by hunting for various areas of the code that govern thumbnail sizes served:
On Wed, Aug 13, 2014 at 12:59 PM, Gilles Dubuc gilles@wikimedia.org wrote:
The context is that Filippo from Ops would like to run a regular cleanup job that deletes thumbnails from swift that have non-mediawiki-requested sizes, when they haven't been accessed for X amount of time. Currently we keep all thumbnails forever.
The idea is that 3rd-party tool requesting odd sizes would result in less storage space used, as what they request would be deleted after a while. This would be accompanied with documentation towards developers indicating that best performance is obtained when using a predefined set of sizes currently in use by the various tools in production (core, extensions, mobile apps and sites, etc.).
This is an interim solution while we still store thumbnails on swift, which in itself is something we want to change in the future.
- we want to use less storage space
Yes
- images we are generating and caching for not-Wikipedia should be the
first to go
Yes. More accurately, images we are currently generating for unknown 3rd parties requesting unusual sizes.
- we assume weird sizes are from not-Wikipedia. So let's cache them for
less time
Either they are coming from unknown 3rd parties, or from defunct code. And yes, the idea is to keep them in swift for a period, instead of keeping them in swift forever.
- except, that doesn't work, because of tall images
We can't differentiate requests coming from core's file page for tall images from odd sizes for anything below 1280px width. Above that, it's a lot easier to tell the difference between code we run and 3rd parties. Which means that we're probably already going to see some significant storage savings. In fact Filippo has given me figures from production, I just have to compile them to know how much storage we're talking about. I'll do that soon and it will be a good opportunity to see how much we're "missing out" due to the <1280 tall images case.
- so maybe we should change the image request format?
If the thumbnail url format could be done by height in addition to width, we could keep the existing file page behavior and differenciate "ours vs theirs" thumbnail requests for sizes below 1280px. It would be a lot of work, we have to see if it's worth it.
- If you want to prioritize Wiki[mp]edia thumbnails, why not use the
referrer header instead? Why use the width parameter to detect this?
Referrer is unreliable in the real world. Browsers can suppress it, so can proxies, etc. The width parameter doesn't tell us the source. If we receive a request for "469" width, we can't tell if it's coming from a 3rd party or a visitor of the file page for an image which is for example 469px wide and 1024px tall.
- Are we sure we'll improve overall performance by evicting certain files
from cache quicker? Why not trust the LRU cache algorithm?
Performance, no, but storage space yes. The idea is that the performance impact would only be limited to clients requesting weird image sizes. I don't think we have a LRU option to speak of, it would be a job written by Ops.
- as maintainers of the wikimedia media file servers, we want to reduce
the number of images cached in order to save storage space and cost?
Yes, and in particular this would allow us to use the existing capacity for more useful purposes, such as pre-generating all expected thumbnail sizes at upload time. Meaning that on "official" clients, or on clients sticking to the extensive list of sizes we'll support will never hit a thumbnail size that needs to be generated on the fly.
is it possible to cache based on a last accessed timestamp?
When we move away from swift, this is exactly what we want to set up. Although it would be interesting to contemplate making exceptions for widely used sizes. What I'm describing is a temporary solution while we still live in the thumbnails-on-swift status quo.
- if an image size has not been accessed within x number of days purge it
from the cache
Basically this is an attempt to do this on swift, while not touching sizes that we know are requested by a lot of clients.
On Wed, Aug 13, 2014 at 12:35 PM, dan-nl dan.entous.wikimedia@gmail.com wrote:
what is the main use case?
- as maintainers of the wikimedia media file servers, we want to reduce
the number of images cached in order to save storage space and cost?
- and/or something else?
is it possible to cache based on a last accessed timestamp?
- if an image size has not been accessed within x number of days purge it
from the cache
with kind regards, dan
On Aug 13, 2014, at 11:18 , Neil Kandalgaonkar neilk@neilk.net wrote:
I think I need more context. Is this what you're saying?
- we want to use less storage space
- images we are generating and caching for not-Wikipedia should be the
first to go
- we assume weird sizes are from not-Wikipedia. So let's cache them for
less time
- except, that doesn't work, because of tall images
- so maybe we should change the image request format?
If this is accurate I have a few questions:
- If you want to prioritize Wiki[mp]edia thumbnails, why not use the
referrer header instead? Why use the width parameter to detect this?
- Are we sure we'll improve overall performance by evicting certain
files from cache quicker? Why not trust the LRU cache algorithm?
On 8/13/14, 1:36 AM, Gilles Dubuc wrote:
Currently the file page provides a set of different image sizes for
the user to directly access. These sizes are usually width-based. However, for tall images they are height-based. The thumbnail urls, which are used to generate them pass only a width.
What this means is that tall images end up with arbitrary thumbnail
widths that don't follow the set of sizes meant for the file page. The end result from an ops perspective is that we end up with very diverse widths for thumbnails. Not a problem in itself, but the exposure of these random-ish widths on the file page means that we can't set a different caching policy for non-standard widths without affecting the images linked from the file page.
I see two solutions to this problem, if we want to introduce different
caching tiers for thumbnail sizes that come from mediawiki and thumbnail sizes that were requested by other things.
The first one would be to always keep the size progression on the file
page width-bound, even for soft-rotated images. The first drawback of this is that for very skinny/very wide images the file size progression between the sizes could become steep. The second drawback is that we'd often offer less size options, because they'd be based on the smallest dimension.
The second option would be to change the syntax of the thumbnail urls
in order to allow height constraint. This is a pretty scary change.
If we don't do anything, it simply means that we'll have to apply the
same caching policy to every size smaller than 1280. We could already save quite a bit of storage space by evicting non-standard sizes larger than that, but sizes lower than 1280 would have to stay the way they are now.
Thoughts?
Multimedia mailing list
Multimedia@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/multimedia
-- Neil Kandalgaonkar (| neilk@neilk.net _______________________________________________ Multimedia mailing list Multimedia@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/multimedia
Multimedia mailing list Multimedia@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/multimedia