Two days ago the disk filled up on one of our servers, Bacon, (http://ganglia.wikimedia.org/pmtpa/graph.php?c=Miscellaneous&h=bacon.wik...).
The full disk resulted in some thumbnails failing to render.
The root problem was resolved, but some of the failed thumbnails remained failed. They could be resolved by purging the image page, or by simply waiting for the cache to expire for them. The technical team considered the matter closed.
Sometime today awareness of broken thumbs on English Wikipedia rocketed up.
Rather than successfully flagging the tech team's attention, a series of inaccurate sitenotices were placed on English Wikipedia and on several other language Wikipedias. The English notice in particular was displayed to the general public.
The notices claimed that the issue was being worked on. This was not correct. The notice most likely caused people to not report the problems they were seeing.
None of the active tech team were aware of any ongoing issue. It was understood that some images would fail to display until their cache expired but this was not believed to be an issue significant enough in scale to justify any action.
When I happened to browse over to enwp as a reader I saw the notice. I asked ST47 to remove the notice. I got a hold of our resident caching god, Mark Bergsma, and went ahead and mass-purged all the thumbnails.
Sometime after that point the incorrect notice was restored on English Wikipedia and revised several times, and in its last version it attempted to give bad directions on how to purge images. It is generally inadvisable to instruct the general public to purge pages on a wide scale for a number of reasons.
All in all this issue was handled poorly all around. On the tech side a status report should have gone out after the fix, and on the Wikipedia admins side no claim should ever be made that a problem is being worked on unless you are darn sure that it is the case.
There are also some issues related to how we communicate with the public, but I'll leave it to someone else to complain about that.
My biggest fear is that had there been a second issue it may have persisted for days with the techs unaware of the problem. I've seen some prior examples of over eagerness to claim something is being worked on in the past in our user communities. It frightens me for this reason.
Hopefully future events will be handled better and this message will increase awareness of the potential issues involved.
Thanks for your time.
On 9/16/07, Gregory Maxwell gmaxwell@gmail.com wrote: [snip]
I asked ST47 to remove the notice. I got a hold of our resident caching god, Mark Bergsma, and went ahead and mass-purged all the thumbnails.
[snip]
I intended to state "and he went ahead and mass-purged all the thumbnails which were believed to be affected by the issue".
My excuse for this failure to proofread is that I've been spoiled by the ability to revise my own comments on the wikis. Yea.. spoiled.. thats the ticket.
On 17/09/2007, Gregory Maxwell gmaxwell@gmail.com wrote: *snip*
I've seen some prior examples of over eagerness to claim something is being worked on in the past in our user communities.
*snip*
On IRC when Wikipedia goes down, I always set the channel entry message to say that our "Technical Response Group" is working to fix the problem, because with something as serious as a Wikipedia downtime, the techs generally are already upon it. Is that wrong?
~Mark Ryan
On 9/16/07, Mark Ryan ultrablue@gmail.com wrote:
On 17/09/2007, Gregory Maxwell gmaxwell@gmail.com wrote: *snip*
I've seen some prior examples of over eagerness to claim something is being worked on in the past in our user communities.
*snip*
On IRC when Wikipedia goes down, I always set the channel entry message to say that our "Technical Response Group" is working to fix the problem, because with something as serious as a Wikipedia downtime, the techs generally are already upon it. Is that wrong?
In cases of serious issues if you do not have direct personal knowledge that someone with shell access (http://meta.wikimedia.org/wiki/Developers) is working on or at least acutely aware of the issue, please do not make the claim that it is being worked on. Allow those who have direct knowledge to make the claim.
Since you mention IRC... you are welcome to join #wikimedia-tech. Please listen for a moment before asking. And be aware that if there is technical banter between folks that doesn't mean the right people are aware of the issue. Many problems can only be addressed people on the sysadmin end of the spectrum and there are a large number of people, including some MediaWiki developers, who are not sysadmins and can not actually fix many problems even if they understand them and are talking about them. Do not assume that any person who knows more than you can fix the issue, will fix the issue, or will even bother to report it to someone who can.
In cases where the site is down, yes... Tech folks will know about it, but there is no harm in not making the statement unless you are sure.
In cases which are serious but are not a total-site down event it is somewhat more likely that we've had some new and exciting mode of failure that the monitoring tools can not yet catch. In these cases it is especially important that we do not prematurely suppress trouble reports.
In all cases over-reporting is preferable to under reporting. The tech IRC channel can be set moderated. Emails and OTRS messages can be filtered. And, of course, if you see one of the people listed with shell access saying "Hush we know already!" then it's a safe bet that the issue is actually being worked on. ;)
Also, if you do decide to contact any of the tech team yourself please try to be detailed and constructive. Entering the tech IRC channel and saying "The darn site is broken AGAIN!" doesn't help fix anything. Instead say something like "When I load any page, like http://en.wikipedia.org/wiki/Foo all the images are upside down. I'm running firefox on windows and this has been going on for two hours!".
On 9/17/07, Gregory Maxwell gmaxwell@gmail.com wrote:
In all cases over-reporting is preferable to under reporting.
In that spirit, the pages-meta-history dump broke, again. Someone please report this to someone who can fix it, and if you could have someone report back letting us know what the problem is and whether or not it'll ever be fixed, that'd be awesome.
Gregory Maxwell wrote:
Rather than successfully flagging the tech team's attention, a series of inaccurate sitenotices were placed on English Wikipedia and on several other language Wikipedias. The English notice in particular was displayed to the general public. [...] None of the active tech team were aware of any ongoing issue.
I don't understand this train of thought. If, as you say, the notice was displayed _to the general public_, how can the tech team remain unaware of it? Are they a bunch of robots sitting in a basement who act only upon direct command and who never browse Wikipedia as a member of the general public?
Presumably the main reason something was mentioned in the sitenotice but not to the tech team is that out of all active Wikipedia admins, a great majority (myself included) probably know how to put something in the sitenotice but not how to contact the "active tech team". If the sitenotice is the only course of action known to any particular admin, then that admin will naturally take that course of action (I know I would if I had been there).
Timwi