Recent thumbnail problems and problem reporting.

List overview All Threads
Download

newer

older

Wikinvest, and an apology

Re: [Wikitech-l] [MediaWiki-CVS]...

Gregory Maxwell

16 Sep 2007 16 Sep '07

9:54 p.m.

Two days ago the disk filled up on one of our servers, Bacon, (http://ganglia.wikimedia.org/pmtpa/graph.php?c=Miscellaneous&h=bacon.wik...).

The full disk resulted in some thumbnails failing to render.

The root problem was resolved, but some of the failed thumbnails remained failed. They could be resolved by purging the image page, or by simply waiting for the cache to expire for them. The technical team considered the matter closed.

Sometime today awareness of broken thumbs on English Wikipedia rocketed up.

Rather than successfully flagging the tech team's attention, a series of inaccurate sitenotices were placed on English Wikipedia and on several other language Wikipedias. The English notice in particular was displayed to the general public.

The notices claimed that the issue was being worked on. This was not correct. The notice most likely caused people to not report the problems they were seeing.

None of the active tech team were aware of any ongoing issue. It was understood that some images would fail to display until their cache expired but this was not believed to be an issue significant enough in scale to justify any action.

When I happened to browse over to enwp as a reader I saw the notice. I asked ST47 to remove the notice. I got a hold of our resident caching god, Mark Bergsma, and went ahead and mass-purged all the thumbnails.

Sometime after that point the incorrect notice was restored on English Wikipedia and revised several times, and in its last version it attempted to give bad directions on how to purge images. It is generally inadvisable to instruct the general public to purge pages on a wide scale for a number of reasons.

All in all this issue was handled poorly all around. On the tech side a status report should have gone out after the fix, and on the Wikipedia admins side no claim should ever be made that a problem is being worked on unless you are darn sure that it is the case.

There are also some issues related to how we communicate with the public, but I'll leave it to someone else to complain about that.

My biggest fear is that had there been a second issue it may have persisted for days with the techs unaware of the problem. I've seen some prior examples of over eagerness to claim something is being worked on in the past in our user communities. It frightens me for this reason.

Hopefully future events will be handled better and this message will increase awareness of the potential issues involved.

Thanks for your time.

Show replies by date

Gregory Maxwell

16 Sep 16 Sep

10:08 p.m.

On 9/16/07, Gregory Maxwell gmaxwell@gmail.com wrote: [snip]

...

I asked ST47 to remove the notice. I got a hold of our resident caching god, Mark Bergsma, and went ahead and mass-purged all the thumbnails.

[snip]

I intended to state "and he went ahead and mass-purged all the thumbnails which were believed to be affected by the issue".

My excuse for this failure to proofread is that I've been spoiled by the ability to revise my own comments on the wikis. Yea.. spoiled.. thats the ticket.

Brianna Laugher

11:51 p.m.

I think it would be useful in these cases of major failure, for someone Official, to post to this list about the problem and when users can expect it to be fixed. On Commons several users came to complain about the problem. I contemplated writing to this list but first I went to the IRC room. The topic had a note about images, and there was already the "[[Water]] images broken" thread on this list so I assumed that Official people were aware of the problem and working on it (or not, as their priorities dictated :)).

But on this mailing list there is not a single post from a dev acknowledging the problem. If people think that changing the topic in the IRC room is enough, in terms of communication, I have to respectfully disagree.

So how is the communication chain between masses of users and a handful of developers supposed to run?

regards, Brianna

On 17/09/2007, Gregory Maxwell gmaxwell@gmail.com wrote:

Steve Summit

17 Sep 17 Sep

12:03 a.m.

Brianna Laugher (is that your real name? :-) ) wrote:

...

...there was already the "[[Water]] images broken" thread on this list so I assumed that Official people were aware of the problem and working on it (or not, as their priorities dictated :)).

The devs are usually pretty responsive, both to this list and (especially) the irc channel. Maybe they were all just away for the weekend, or something?

Tim Starling

6:55 a.m.

Steve Summit wrote:

...

Brianna Laugher (is that your real name? :-) ) wrote:

...
...there was already the "[[Water]] images broken" thread on this list so I assumed that Official people were aware of the problem and working on it (or not, as their priorities dictated :)).

The devs are usually pretty responsive, both to this list and (especially) the irc channel. Maybe they were all just away for the weekend, or something?

I'm busy moving back to Australia. I thought other people could probably handle this. I'll be largely unavailable until 19 September, approx 02:00 UTC.

I'm not aware of any evidence that the problem was fixed by switching bacon off, so it might be still ongoing. This needs to be monitored closely.

And in another post:

...

find -size 0 -print | xargs rm

It's a bit more complicated than that, since we're not using the images stored on bacon anymore. However they may be stored in the squid cluster. Something like...

find * -size 0 -print | perl -ne 'print "http://upload.wikimedia.org/$_%5Cn"' | php purgeList.php

That may work, or at least rule out the current hypothesis.

-- Tim Starling

Mark Bergsma

7:42 a.m.

Tim Starling wrote:

...

...
The devs are usually pretty responsive, both to this list and (especially) the irc channel. Maybe they were all just away for the weekend, or something?

I'm busy moving back to Australia. I thought other people could probably handle this. I'll be largely unavailable until 19 September, approx 02:00 UTC.

I'm not aware of any evidence that the problem was fixed by switching bacon off, so it might be still ongoing. This needs to be monitored closely.

Ok, have a nice trip, I'll keep an eye on it. I can investigate it more closely when I'm free in 3-4 hours.

...

And in another post:

...
find -size 0 -print | xargs rm

It's a bit more complicated than that, since we're not using the images stored on bacon anymore. However they may be stored in the squid cluster. Something like...

find * -size 0 -print | perl -ne 'print "http://upload.wikimedia.org/$_%5Cn"' | php purgeList.php

That may work, or at least rule out the current hypothesis.

I have run a similar job last night. However that doesn't seem to totally fix it. Also users are reporting that even after purging and reappearing of thumbs, those thumbs often break again after a while.

-- Mark Bergsma mark@wikimedia.org System & Network Administrator, Wikimedia Foundation

Tim Starling

7:44 a.m.

I wrote:

...

I'm not aware of any evidence that the problem was fixed by switching bacon off, so it might be still ongoing. This needs to be monitored closely.

Turns out bacon wasn't switched off at all. It is now, and I'm running a purge script again (Mark ran one yesterday).

-- Tim Starling

Brion Vibber

10:38 a.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Steve Summit wrote:

...

Brianna Laugher (is that your real name? :-) ) wrote:

...
...there was already the "[[Water]] images broken" thread on this list so I assumed that Official people were aware of the problem and working on it (or not, as their priorities dictated :)).

The devs are usually pretty responsive, both to this list and (especially) the irc channel. Maybe they were all just away for the weekend, or something?

Believe it or not, some of us actually aren't in the office on weekends, and if nobody bothers to contact us, we might not notice something until Monday. :)

- -- brion vibber (brion @ wikimedia.org)

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFG7pFRwRnhpk1wk44RAh5hAJ4l1Dv4cGb3SkNAVuTfYjWLtnEFNwCbBi7F TPbVJcsGCvznkWcvwmH070E= =Oh23 -----END PGP SIGNATURE-----

Steve Summit

11:37 a.m.

Brion wrote:

...

Steve Summit wrote:

...
The devs are usually pretty responsive, both to this list and (especially) the irc channel. Maybe they were all just away for the weekend, or something?

Believe it or not, some of us actually aren't in the office on weekends,

Fair enough!

...

and if nobody bothers to contact us, we might not notice something until Monday. :)

So the question of the day is, what's the right way, and who should do it? In the past I've been told that Wikitech-l was not necessarily immediate and that #wikimedia-tech (IRC) was better. So what's the next escalation step?

Gregory Maxwell

11:54 a.m.

On 9/17/07, Steve Summit scs@eskimo.com wrote:

...

So the question of the day is, what's the right way, and who should do it? In the past I've been told that Wikitech-l was not necessarily immediate and that #wikimedia-tech (IRC) was better. So what's the next escalation step?

You must first get as far that escalation stage before going to the next. For the most part ongoing issues were not being reported to the tech channel in this case.

While the number of people who can fix important problems is fairly small there is a larger number of people who have their phone-numbers and a few brave souls who are foolish enough to not fear using them! ;)

In all seriousness, I've not personally witnessed any recent cases where the right people or a reasonable stand in couldn't be reached quickly. That isn't where the breakdown is...

Jay R. Ashworth

1:09 p.m.

On Mon, Sep 17, 2007 at 10:38:09AM -0400, Brion Vibber wrote:

...

Believe it or not, some of us actually aren't in the office on weekends,

Office? You work in an office?

You got demoted? ;-)

Cheers -- jra

-- Jay R. Ashworth Baylink jra@baylink.com Designer The Things I Think RFC 2100 Ashworth & Associates http://baylink.pitas.com '87 e24 St Petersburg FL USA http://photo.imageinc.us +1 727 647 1274

Daniel Cannon

20 Sep 20 Sep

4:53 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Steve Summit wrote:

...

Brianna Laugher (is that your real name? :-) ) wrote:

...
...there was already the "[[Water]] images broken" thread on this list so I assumed that Official people were aware of the problem and working on it (or not, as their priorities dictated :)).

The devs are usually pretty responsive, both to this list and (especially) the irc channel. Maybe they were all just away for the weekend, or something?

Or maybe ... uhh ... busy fixing the problem? :)

- -- Daniel Cannon (AmiDaniel)

http://amidaniel.com cannon.danielc@gmail.com

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFG8t20FRAT5u/mSaMRAjejAJ95pLTQ7ilQC2kUkMU7BIP41WBIzACeMZZ2 TBALQr0Ezkq5hJDRQSE7fpc= =RRWW -----END PGP SIGNATURE-----

Gregory Maxwell

5:12 p.m.

On 9/20/07, Daniel Cannon cannon.danielc@gmail.com wrote:

...

...
The devs are usually pretty responsive, both to this list and (especially) the irc channel. Maybe they were all just away for the weekend, or something?

Or maybe ... uhh ... busy fixing the problem? :)

Nope.

Rob Church

8:12 p.m.

On 20/09/2007, Daniel Cannon cannon.danielc@gmail.com wrote:

...

Or maybe ... uhh ... busy fixing the problem? :)

Apparently not, since they didn't know the problem existed, as has been previously established in this thread.

Rob Church

Gregory Maxwell

17 Sep 17 Sep

12:20 a.m.

On 9/16/07, Brianna Laugher brianna.laugher@gmail.com wrote: [snip]

...

So how is the communication chain between masses of users and a handful of developers supposed to run?

Normally serious outages are followed up by a message to wikitech-l.

Seems things didn't work as well as usual in this case. This seems like an opportunity to learn and improve a little.

Jay R. Ashworth

1:07 p.m.

On Mon, Sep 17, 2007 at 01:51:45PM +1000, Brianna Laugher wrote:

...

So how is the communication chain between masses of users and a handful of developers supposed to run?

Try finding a phone number to call eBay about a problem.

Go ahead. I dare you. :-)

When you have a billion and six users and, what, 15, 20 developers, most part time, the problem is one which can often be fixed only by making those communication channels informal and undocumented.

If you *have* your Geek License, you'll know where to go, and you'll be a good enough problem reporter, in general, to be listened to once you get there.

Have you ever spent any time reading Tier 1 problem reports, Brianna? :-)

It would take us *another* 20 people to set up a formal problem erporting system.

No, in this case, I agree with (was it) Geoff: the problem wasn't the disk fill, nor the busted thumbnails, nor even the purge, it was the site notice, which, as he says, shouldn't have gone up until it was confirmed that someone was actually working on it.

Cheers, -- jra

-- Jay R. Ashworth Baylink jra@baylink.com Designer The Things I Think RFC 2100 Ashworth & Associates http://baylink.pitas.com '87 e24 St Petersburg FL USA http://photo.imageinc.us +1 727 647 1274

Jay R. Ashworth

1:09 p.m.

On Mon, Sep 17, 2007 at 01:07:56PM -0400, Jay R. Ashworth wrote:

...

No, in this case, I agree with (was it) Geoff: the problem wasn't the

Nope; it was Greg. Sorry, Greg; one of mutt's few weaknesses.

Cheers, -- jra

-- Jay R. Ashworth Baylink jra@baylink.com Designer The Things I Think RFC 2100 Ashworth & Associates http://baylink.pitas.com '87 e24 St Petersburg FL USA http://photo.imageinc.us +1 727 647 1274

6305

Age (days ago)

6309

Last active (days ago)

wikitech-l@lists.wikimedia.org

16 comments

9 participants

tags (0)

participants (9)

Brianna Laugher
Brion Vibber
Daniel Cannon
Gregory Maxwell
Jay R. Ashworth
Mark Bergsma
Rob Church
Steve Summit
Tim Starling