Hi everyone
Short version: Some people have reported issues with corrupted
thumbnail images in which they appear truncated. We’ve identified the
cause of the problem and believe we have a fix in place. It may take
a few days to fully propagate, during which you may continue to see
corrupted thumbnails. Manually purging the image page should repair
any broken thumbnail.
Longer version: Last week, we enabled Swift as a replacement for NFS
for storing images (see blog post [1]). Our goal with this project is
to replace a single point of failure, and increase fault tolerance and
capacity.
Recently, we discovered that images were sometimes getting corrupted.
After further investigation, Tim Starling and Ralf Schmitt both
independently figured out that whenever a client disconnected early
while fetching a thumbnail, the server would write out a partial
thumbnail to our Swift cluster and to the cache. Ben ran the numbers,
and estimated that roughly 1.6% of the thumbnails were corrupted, and
that roughly 4.5% of images had at least one corrupt thumbnail.
We've disabled Swift for the time being, going back to our old way of
serving thumbnails. Unfortunately, even though thumbnails are no
longer coming from our Swift cluster, there will still be images in
our Squid cache which we can't easily purge (without creating a large
performance problem), so that step only stops the problem from getting
worse rather than fixing it. Thankfully, we're reasonably confident
there won't be any new broken images.
Aaron Schulz came up with a pretty simple fix. We were writing
thumbnails to Swift while streaming them to the client, so when the
client disconnected, so did the process writing the file into Swift.
We have added an MD5 checksum of the generated image to the ETag
header when pushing it to Swift. Swift accepts the file for writing,
and if the MD5 checksum doesn’t match after the connection to Swift
closes, the partial thumbnail is deleted.
The changes were two small fixes: one in the thumb generation:
http://www.mediawiki.org/wiki/Special:Code/MediaWiki/111517
...and one in the process that writes images to Swift:
https://gerrit.wikimedia.org/r/#change,2598
We aren’t planning to deploy this right away. What we want to do
instead is repair the damage first. Ben will write a script that
crawls through Swift, searches for corrupt images, nukes them from
Swift and purges them from our Squid cache via HTCP. After that’s
complete, we’ll then re-enable the new improved thumbnail pipeline
with Swift, which *should* no longer keep partial images.
Sorry for any problems this might have caused.
Rob
[1] Ben’s announcement of our Swift deployment
http://blog.wikimedia.org/2012/02/09/scaling-media-storage-at-wikimedia-wit…