Two days ago the disk filled up on one of our servers, Bacon, (http://ganglia.wikimedia.org/pmtpa/graph.php?c=Miscellaneous&h=bacon.wik...).
The full disk resulted in some thumbnails failing to render.
The root problem was resolved, but some of the failed thumbnails remained failed. They could be resolved by purging the image page, or by simply waiting for the cache to expire for them. The technical team considered the matter closed.
Sometime today awareness of broken thumbs on English Wikipedia rocketed up.
Rather than successfully flagging the tech team's attention, a series of inaccurate sitenotices were placed on English Wikipedia and on several other language Wikipedias. The English notice in particular was displayed to the general public.
The notices claimed that the issue was being worked on. This was not correct. The notice most likely caused people to not report the problems they were seeing.
None of the active tech team were aware of any ongoing issue. It was understood that some images would fail to display until their cache expired but this was not believed to be an issue significant enough in scale to justify any action.
When I happened to browse over to enwp as a reader I saw the notice. I asked ST47 to remove the notice. I got a hold of our resident caching god, Mark Bergsma, and went ahead and mass-purged all the thumbnails.
Sometime after that point the incorrect notice was restored on English Wikipedia and revised several times, and in its last version it attempted to give bad directions on how to purge images. It is generally inadvisable to instruct the general public to purge pages on a wide scale for a number of reasons.
All in all this issue was handled poorly all around. On the tech side a status report should have gone out after the fix, and on the Wikipedia admins side no claim should ever be made that a problem is being worked on unless you are darn sure that it is the case.
There are also some issues related to how we communicate with the public, but I'll leave it to someone else to complain about that.
My biggest fear is that had there been a second issue it may have persisted for days with the techs unaware of the problem. I've seen some prior examples of over eagerness to claim something is being worked on in the past in our user communities. It frightens me for this reason.
Hopefully future events will be handled better and this message will increase awareness of the potential issues involved.
Thanks for your time.