To revive this old thread...
On Sep 5, 2012, at 9:35 PM, Asher Feldman afeldman@wikimedia.org wrote:
On Tue, Sep 4, 2012 at 3:11 PM, Platonides Platonides@gmail.com wrote:
On 03/09/12 02:59, Tim Starling wrote:
I'll go for option 4. You can't delete the images from the backend while they are still in Squid, because then they would not be purged when the image is updated or action=purge is requested. In fact, that is one of only two reasons for the existence of the backend thumbnail store on Wikimedia. The thumbnail backend could be replaced by a text file that stores a list of thumbnail filenames which were sent to Squid within a window equivalent to the expiry time sent in the Cache-Control header. -- Tim Starling
The second one seems easy to fix. The first one should IMHO be fixed in squid/varnish by allowing wildcard purges (ie. PURGE /wikipedia/commons/thumb/5/5c/Tim_starling.jpg/* HTTP/1.0)
fast.ly implements group purge for varnish like this via a proxy daemon that watches backend responses for a "tag" response header (i.e. all resolutions of Tim_starling.jpg would be tagged that) and builds an in-memory hash of tags->objects which can be purged on. I've been told they'd probably open source the code for us if we want it, and it is interesting (especially to deal with the fact that we don't purge articles at all of their possible url's) albeit with its own challenges. If we implemented a backend system to track thumbnails that exist for a given orig, we may be able to remove our dependency on swift container listings to purge images, paving the way for a second class of thumbnails that are only cached.
How about this idea:
Just "purge all images with this prefix" doesn't really work in Squid or Varnish, because they don't store their cache database in a format that makes it cheap to determine which objects would match that. Varnish could do it with their "bans", but each ban is kept around for a long time, and with the tens, sometimes hundreds of purges a second we do, this would quickly add up to a massive ban list.
But... Varnish allows you to customize how it hashes objects into its object hash table (vcl_hash). What we could do, is hash thumbnails to the same hash key as their original. Because of our current URL structure, that's pretty much a matter of stripping off the thumbnail postfix. Then the original and all its associated thumbnails end up at the same hash key in the hash table, and only a single purge for the original would nuke them all out of the cache.
This relies on Varnish having an efficient implementation for multiple objects at a single hash key. It probably does, since it implements Vary processing this way. We would essentially be doing the same, Vary-ing on the thumbnail size. But I'll check the implementation to be sure.
Of course this won't work for Squid, but I'm pretty close to being able to replace Squid by Varnish entirely for upload.