Upload URL/filesystem restructuring - Wikitech-l

23 Oct 2005


      Just thought I'd float this idea for comments before I try working on it...
Between multi-megapixel digital photographs and other wacky multimedia
fun, uploads are taking up an ever-huger amount of disk space,
bandwidth, etc. Our existing primary image fileserver is a bit sluggish;
a new one with a nice big drive array is on order but we still would
like to provide for better local and downstream caching.
It would make caching much easier if the file at a given URL was
immutable; that is, if a replacement image has a different URL from the
old one.
For an example of the problem with mutable images, take this scenario:
1) A featured article has a photo, say, [[Image:Puppy.jpg]]
2) Somebody uploads goatse.cx on top of it.
3) A visitor comes, and fetches the goatse image at:
http://upload.wikimedia.org/wikipedia/en/a/a1/Puppy.jpg
His ISP's transparent proxy caches the image.
4) An admin reverts the image back to the puppy and protects it.
5) Another visitor loads the article, and fetches the puppy image at
http://upload.wikimedia.org/wikipedia/en/a/a1/Puppy.jpg
He's from the same ISP, and the proxy returns the previously loaded
goatse image.
6) The visitor e-mails the Wikimedia board to complain about their *very
offensive* web site. ;)
One possibility is to embed the timestamp into the URL. So the goatse
version might be:
http://upload.wikimedia.org/wikipedia/en/2005/10/23/074223/Puppy.jpg
and the reverted image would get a different URL, a few minutes later:
http://upload.wikimedia.org/wikipedia/en/2005/10/23/074506/Puppy.jpg
(The article pages need to be rerendered with the new link, but this is
already necessary to accomodate changes in size, etc. Articles are
forced to be rechecked from end-clients and are only cached by proxies
we control and send explicit purges to, so that 'should' stay under
control.)
This scheme would allow for outside proxy caches to cache a given image
file indefinitely without it becoming dangerously stale, as well as more
permanent on-demand replicated image servers to distribute bandwidth
across our clusters without stealing squid cache space from articles.
A downside is that image URLs aren't predictable ahead of time; unless
you're in the database to check what the latest version of the image is
you can't build the URL from just a file name. One could though concoct
a litte special page or something to redirect to whatever the current
version is.
Another benefit is that not having the "/a/ad/" cache directory will
allow people with badly written ad blockers to see the missing 1/256th
of our images again. ;)
-- brion vibber (brion @ pobox.com)