Hi everyone
Short version: Some people have reported issues with corrupted thumbnail images in which they appear truncated. We’ve identified the cause of the problem and believe we have a fix in place. It may take a few days to fully propagate, during which you may continue to see corrupted thumbnails. Manually purging the image page should repair any broken thumbnail.
Longer version: Last week, we enabled Swift as a replacement for NFS for storing images (see blog post [1]). Our goal with this project is to replace a single point of failure, and increase fault tolerance and capacity.
Recently, we discovered that images were sometimes getting corrupted. After further investigation, Tim Starling and Ralf Schmitt both independently figured out that whenever a client disconnected early while fetching a thumbnail, the server would write out a partial thumbnail to our Swift cluster and to the cache. Ben ran the numbers, and estimated that roughly 1.6% of the thumbnails were corrupted, and that roughly 4.5% of images had at least one corrupt thumbnail.
We've disabled Swift for the time being, going back to our old way of serving thumbnails. Unfortunately, even though thumbnails are no longer coming from our Swift cluster, there will still be images in our Squid cache which we can't easily purge (without creating a large performance problem), so that step only stops the problem from getting worse rather than fixing it. Thankfully, we're reasonably confident there won't be any new broken images.
Aaron Schulz came up with a pretty simple fix. We were writing thumbnails to Swift while streaming them to the client, so when the client disconnected, so did the process writing the file into Swift. We have added an MD5 checksum of the generated image to the ETag header when pushing it to Swift. Swift accepts the file for writing, and if the MD5 checksum doesn’t match after the connection to Swift closes, the partial thumbnail is deleted.
The changes were two small fixes: one in the thumb generation: http://www.mediawiki.org/wiki/Special:Code/MediaWiki/111517
...and one in the process that writes images to Swift: https://gerrit.wikimedia.org/r/#change,2598
We aren’t planning to deploy this right away. What we want to do instead is repair the damage first. Ben will write a script that crawls through Swift, searches for corrupt images, nukes them from Swift and purges them from our Squid cache via HTCP. After that’s complete, we’ll then re-enable the new improved thumbnail pipeline with Swift, which *should* no longer keep partial images.
Sorry for any problems this might have caused.
Rob
[1] Ben’s announcement of our Swift deployment http://blog.wikimedia.org/2012/02/09/scaling-media-storage-at-wikimedia-with...