Two days ago the disk filled up on one of our servers, Bacon, (http://ganglia.wikimedia.org/pmtpa/graph.php?c=Miscellaneous&h=bacon.wik...).
The full disk resulted in some thumbnails failing to render.
The root problem was resolved, but some of the failed thumbnails remained failed. They could be resolved by purging the image page, or by simply waiting for the cache to expire for them. The technical team considered the matter closed.
Sometime today awareness of broken thumbs on English Wikipedia rocketed up.
Rather than successfully flagging the tech team's attention, a series of inaccurate sitenotices were placed on English Wikipedia and on several other language Wikipedias. The English notice in particular was displayed to the general public.
The notices claimed that the issue was being worked on. This was not correct. The notice most likely caused people to not report the problems they were seeing.
None of the active tech team were aware of any ongoing issue. It was understood that some images would fail to display until their cache expired but this was not believed to be an issue significant enough in scale to justify any action.
When I happened to browse over to enwp as a reader I saw the notice. I asked ST47 to remove the notice. I got a hold of our resident caching god, Mark Bergsma, and went ahead and mass-purged all the thumbnails.
Sometime after that point the incorrect notice was restored on English Wikipedia and revised several times, and in its last version it attempted to give bad directions on how to purge images. It is generally inadvisable to instruct the general public to purge pages on a wide scale for a number of reasons.
All in all this issue was handled poorly all around. On the tech side a status report should have gone out after the fix, and on the Wikipedia admins side no claim should ever be made that a problem is being worked on unless you are darn sure that it is the case.
There are also some issues related to how we communicate with the public, but I'll leave it to someone else to complain about that.
My biggest fear is that had there been a second issue it may have persisted for days with the techs unaware of the problem. I've seen some prior examples of over eagerness to claim something is being worked on in the past in our user communities. It frightens me for this reason.
Hopefully future events will be handled better and this message will increase awareness of the potential issues involved.
Thanks for your time.
On 9/16/07, Gregory Maxwell gmaxwell@gmail.com wrote: [snip]
I asked ST47 to remove the notice. I got a hold of our resident caching god, Mark Bergsma, and went ahead and mass-purged all the thumbnails.
[snip]
I intended to state "and he went ahead and mass-purged all the thumbnails which were believed to be affected by the issue".
My excuse for this failure to proofread is that I've been spoiled by the ability to revise my own comments on the wikis. Yea.. spoiled.. thats the ticket.
I think it would be useful in these cases of major failure, for someone Official, to post to this list about the problem and when users can expect it to be fixed. On Commons several users came to complain about the problem. I contemplated writing to this list but first I went to the IRC room. The topic had a note about images, and there was already the "[[Water]] images broken" thread on this list so I assumed that Official people were aware of the problem and working on it (or not, as their priorities dictated :)).
But on this mailing list there is not a single post from a dev acknowledging the problem. If people think that changing the topic in the IRC room is enough, in terms of communication, I have to respectfully disagree.
So how is the communication chain between masses of users and a handful of developers supposed to run?
regards, Brianna
On 17/09/2007, Gregory Maxwell gmaxwell@gmail.com wrote:
Brianna Laugher (is that your real name? :-) ) wrote:
...there was already the "[[Water]] images broken" thread on this list so I assumed that Official people were aware of the problem and working on it (or not, as their priorities dictated :)).
The devs are usually pretty responsive, both to this list and (especially) the irc channel. Maybe they were all just away for the weekend, or something?
Steve Summit wrote:
Brianna Laugher (is that your real name? :-) ) wrote:
...there was already the "[[Water]] images broken" thread on this list so I assumed that Official people were aware of the problem and working on it (or not, as their priorities dictated :)).
The devs are usually pretty responsive, both to this list and (especially) the irc channel. Maybe they were all just away for the weekend, or something?
I'm busy moving back to Australia. I thought other people could probably handle this. I'll be largely unavailable until 19 September, approx 02:00 UTC.
I'm not aware of any evidence that the problem was fixed by switching bacon off, so it might be still ongoing. This needs to be monitored closely.
And in another post:
find -size 0 -print | xargs rm
It's a bit more complicated than that, since we're not using the images stored on bacon anymore. However they may be stored in the squid cluster. Something like...
find * -size 0 -print | perl -ne 'print "http://upload.wikimedia.org/$_%5Cn"' | php purgeList.php
That may work, or at least rule out the current hypothesis.
-- Tim Starling
Tim Starling wrote:
The devs are usually pretty responsive, both to this list and (especially) the irc channel. Maybe they were all just away for the weekend, or something?
I'm busy moving back to Australia. I thought other people could probably handle this. I'll be largely unavailable until 19 September, approx 02:00 UTC.
I'm not aware of any evidence that the problem was fixed by switching bacon off, so it might be still ongoing. This needs to be monitored closely.
Ok, have a nice trip, I'll keep an eye on it. I can investigate it more closely when I'm free in 3-4 hours.
And in another post:
find -size 0 -print | xargs rm
It's a bit more complicated than that, since we're not using the images stored on bacon anymore. However they may be stored in the squid cluster. Something like...
find * -size 0 -print | perl -ne 'print "http://upload.wikimedia.org/$_%5Cn"' | php purgeList.php
That may work, or at least rule out the current hypothesis.
I have run a similar job last night. However that doesn't seem to totally fix it. Also users are reporting that even after purging and reappearing of thumbs, those thumbs often break again after a while.
I wrote:
I'm not aware of any evidence that the problem was fixed by switching bacon off, so it might be still ongoing. This needs to be monitored closely.
Turns out bacon wasn't switched off at all. It is now, and I'm running a purge script again (Mark ran one yesterday).
-- Tim Starling
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Steve Summit wrote:
Brianna Laugher (is that your real name? :-) ) wrote:
...there was already the "[[Water]] images broken" thread on this list so I assumed that Official people were aware of the problem and working on it (or not, as their priorities dictated :)).
The devs are usually pretty responsive, both to this list and (especially) the irc channel. Maybe they were all just away for the weekend, or something?
Believe it or not, some of us actually aren't in the office on weekends, and if nobody bothers to contact us, we might not notice something until Monday. :)
- -- brion vibber (brion @ wikimedia.org)
Brion wrote:
Steve Summit wrote:
The devs are usually pretty responsive, both to this list and (especially) the irc channel. Maybe they were all just away for the weekend, or something?
Believe it or not, some of us actually aren't in the office on weekends,
Fair enough!
and if nobody bothers to contact us, we might not notice something until Monday. :)
So the question of the day is, what's the right way, and who should do it? In the past I've been told that Wikitech-l was not necessarily immediate and that #wikimedia-tech (IRC) was better. So what's the next escalation step?
On 9/17/07, Steve Summit scs@eskimo.com wrote:
So the question of the day is, what's the right way, and who should do it? In the past I've been told that Wikitech-l was not necessarily immediate and that #wikimedia-tech (IRC) was better. So what's the next escalation step?
You must first get as far that escalation stage before going to the next. For the most part ongoing issues were not being reported to the tech channel in this case.
While the number of people who can fix important problems is fairly small there is a larger number of people who have their phone-numbers and a few brave souls who are foolish enough to not fear using them! ;)
In all seriousness, I've not personally witnessed any recent cases where the right people or a reasonable stand in couldn't be reached quickly. That isn't where the breakdown is...
On Mon, Sep 17, 2007 at 10:38:09AM -0400, Brion Vibber wrote:
Believe it or not, some of us actually aren't in the office on weekends,
Office? You work in an office?
You got demoted? ;-)
Cheers -- jra
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Steve Summit wrote:
Brianna Laugher (is that your real name? :-) ) wrote:
...there was already the "[[Water]] images broken" thread on this list so I assumed that Official people were aware of the problem and working on it (or not, as their priorities dictated :)).
The devs are usually pretty responsive, both to this list and (especially) the irc channel. Maybe they were all just away for the weekend, or something?
Or maybe ... uhh ... busy fixing the problem? :)
- -- Daniel Cannon (AmiDaniel)
http://amidaniel.com cannon.danielc@gmail.com
On 9/20/07, Daniel Cannon cannon.danielc@gmail.com wrote:
The devs are usually pretty responsive, both to this list and (especially) the irc channel. Maybe they were all just away for the weekend, or something?
Or maybe ... uhh ... busy fixing the problem? :)
Nope.
On 20/09/2007, Daniel Cannon cannon.danielc@gmail.com wrote:
Or maybe ... uhh ... busy fixing the problem? :)
Apparently not, since they didn't know the problem existed, as has been previously established in this thread.
Rob Church
On 9/16/07, Brianna Laugher brianna.laugher@gmail.com wrote: [snip]
So how is the communication chain between masses of users and a handful of developers supposed to run?
Normally serious outages are followed up by a message to wikitech-l.
Seems things didn't work as well as usual in this case. This seems like an opportunity to learn and improve a little.
On Mon, Sep 17, 2007 at 01:51:45PM +1000, Brianna Laugher wrote:
So how is the communication chain between masses of users and a handful of developers supposed to run?
Try finding a phone number to call eBay about a problem.
Go ahead. I dare you. :-)
When you have a billion and six users and, what, 15, 20 developers, most part time, the problem is one which can often be fixed only by making those communication channels informal and undocumented.
If you *have* your Geek License, you'll know where to go, and you'll be a good enough problem reporter, in general, to be listened to once you get there.
Have you ever spent any time reading Tier 1 problem reports, Brianna? :-)
It would take us *another* 20 people to set up a formal problem erporting system.
No, in this case, I agree with (was it) Geoff: the problem wasn't the disk fill, nor the busted thumbnails, nor even the purge, it was the site notice, which, as he says, shouldn't have gone up until it was confirmed that someone was actually working on it.
Cheers, -- jra
On Mon, Sep 17, 2007 at 01:07:56PM -0400, Jay R. Ashworth wrote:
No, in this case, I agree with (was it) Geoff: the problem wasn't the
Nope; it was Greg. Sorry, Greg; one of mutt's few weaknesses.
Cheers, -- jra
wikitech-l@lists.wikimedia.org