Just thought I'd float this idea for comments before I try working on it...
Between multi-megapixel digital photographs and other wacky multimedia fun, uploads are taking up an ever-huger amount of disk space, bandwidth, etc. Our existing primary image fileserver is a bit sluggish; a new one with a nice big drive array is on order but we still would like to provide for better local and downstream caching.
It would make caching much easier if the file at a given URL was immutable; that is, if a replacement image has a different URL from the old one.
For an example of the problem with mutable images, take this scenario:
1) A featured article has a photo, say, [[Image:Puppy.jpg]] 2) Somebody uploads goatse.cx on top of it. 3) A visitor comes, and fetches the goatse image at: http://upload.wikimedia.org/wikipedia/en/a/a1/Puppy.jpg His ISP's transparent proxy caches the image. 4) An admin reverts the image back to the puppy and protects it. 5) Another visitor loads the article, and fetches the puppy image at http://upload.wikimedia.org/wikipedia/en/a/a1/Puppy.jpg He's from the same ISP, and the proxy returns the previously loaded goatse image. 6) The visitor e-mails the Wikimedia board to complain about their *very offensive* web site. ;)
One possibility is to embed the timestamp into the URL. So the goatse version might be: http://upload.wikimedia.org/wikipedia/en/2005/10/23/074223/Puppy.jpg
and the reverted image would get a different URL, a few minutes later: http://upload.wikimedia.org/wikipedia/en/2005/10/23/074506/Puppy.jpg
(The article pages need to be rerendered with the new link, but this is already necessary to accomodate changes in size, etc. Articles are forced to be rechecked from end-clients and are only cached by proxies we control and send explicit purges to, so that 'should' stay under control.)
This scheme would allow for outside proxy caches to cache a given image file indefinitely without it becoming dangerously stale, as well as more permanent on-demand replicated image servers to distribute bandwidth across our clusters without stealing squid cache space from articles.
A downside is that image URLs aren't predictable ahead of time; unless you're in the database to check what the latest version of the image is you can't build the URL from just a file name. One could though concoct a litte special page or something to redirect to whatever the current version is.
Another benefit is that not having the "/a/ad/" cache directory will allow people with badly written ad blockers to see the missing 1/256th of our images again. ;)
-- brion vibber (brion @ pobox.com)
It would make caching much easier if the file at a given URL was immutable; that is, if a replacement image has a different URL from the old one.
I thought I had suggested this a long time ago already. :)
In my opinion, you should redesign the image/oldimage table structure in the same way that cur/old was changed into page/revision. Each image revision, whether "current" or not, should have a numerical ID for a primary key. You can then use that integer in your unique URLs. (I am against the idea of timestamps because they are never unique. ;-) ) This has the added bonus side-effect that (a) when an image gets overwritten by a new version, the old image's URL is still the same; (b) reverting to an older image revision allows you to just revert to the old URL instead of creating a new one, thereby somewhat helping caching. You could even use MD5/SHA1/whatever checksums to detect duplicate uploads and re-use the already-existing image revision with its already-existing URL.
LiveJournal did the same originally with their user picture URLs. I realise they ran into the concern that it would allow people to enumerate all user pictures, giving an easy way to write a good-faith script that would kill the servers. As a counter-measure, they added the numerical ID of the user the image belongs to to the URL as well. Since that never changes either, you still have the nice 'permanent URLs' effect.
So we will have to think about whether we have the same concern. If we do, we can add the numerical user ID of the image's uploader to its URL, or (as we do already) part of a checksum. We currently checksum the image's filename, however, which I strongly object to, because it makes it harder to code an image-renaming feature. If we're going to redesign this anyway, we might as well make it so that this feature will be easier to code in the future.
Note also that the most commonly requested image-related feature is the ability to undelete an image, thereby removing the last bit of irreversibility in an admin's toolset.
Timwi
On 10/23/05, Timwi timwi@gmx.net wrote:
LiveJournal did the same originally with their user picture URLs. I realise they ran into the concern that it would allow people to enumerate all user pictures, giving an easy way to write a good-faith script that would kill the servers. As a counter-measure, they added the numerical ID of the user the image belongs to to the URL as well. Since that never changes either, you still have the nice 'permanent URLs' effect.
We also hardcoded the image-serving httpd to immediately respond with a 301 if the request had an If-modified-since header, which made a huge difference in traffic. (But LJ has a lot more "reload" traffic than WP does, I'd expect.)
One advantage of using a hash (instead of a timestamp or an autoincrement secondary key) as a guid is that multiple uploads of the same image can be stored as one instance. I guess it depends on whether duplicates are a concern; for example, how often is the one goatse image uploaded?
Brion Vibber wrote:
One possibility is to embed the timestamp into the URL. So the goatse version might be: http://upload.wikimedia.org/wikipedia/en/2005/10/23/074223/Puppy.jpg
and the reverted image would get a different URL, a few minutes later: http://upload.wikimedia.org/wikipedia/en/2005/10/23/074506/Puppy.jpg
That sounds very sensible. It wouldn't slow down page rendering at all. We'd better implement that redirecting special page sooner rather than later, because you know the bot writers will never ask for it. They'll just code up ugly hacks like parsing the image description page. We'll find out 6 months later when we change the skin and break all the bots.
Since you're redesigning, would now be a good time to implement an archive for deleted images?
-- Tim Starling
Brion Vibber wrote:
Just thought I'd float this idea for comments before I try working on it...
Between multi-megapixel digital photographs and other wacky multimedia fun, uploads are taking up an ever-huger amount of disk space, bandwidth, etc. Our existing primary image fileserver is a bit sluggish; a new one with a nice big drive array is on order but we still would like to provide for better local and downstream caching.
It would make caching much easier if the file at a given URL was immutable; that is, if a replacement image has a different URL from the old one.
[snip]
One possibility is to embed the timestamp into the URL. So the goatse version might be: http://upload.wikimedia.org/wikipedia/en/2005/10/23/074223/Puppy.jpg
and the reverted image would get a different URL, a few minutes later: http://upload.wikimedia.org/wikipedia/en/2005/10/23/074506/Puppy.jpg
(The article pages need to be rerendered with the new link, but this is already necessary to accomodate changes in size, etc. Articles are forced to be rechecked from end-clients and are only cached by proxies we control and send explicit purges to, so that 'should' stay under control.)
This scheme would allow for outside proxy caches to cache a given image file indefinitely without it becoming dangerously stale, as well as more permanent on-demand replicated image servers to distribute bandwidth across our clusters without stealing squid cache space from articles.
A downside is that image URLs aren't predictable ahead of time; unless you're in the database to check what the latest version of the image is you can't build the URL from just a file name. One could though concoct a litte special page or something to redirect to whatever the current version is.
Another benefit is that not having the "/a/ad/" cache directory will allow people with badly written ad blockers to see the missing 1/256th of our images again. ;)
-- brion vibber (brion @ pobox.com)
This is a good idea. Cache-friendliness is really important, and, done right, can greatly reduce load and increase performance, without affecting the freshness of content for end-users. (See my remarks from last week).
-- Neil
wikitech-l@lists.wikimedia.org