error lasted more than 10 minutes....

List overview All Threads
Download

newer

older

Wikipedia CAPTCHA repair

1.18 PHP version requirement

William Allen Simpson

27 Nov 2011 27 Nov '11

2:17 p.m.

https://en.wikipedia.org/w/index.php?title=Sharon_Aguilar&action=history

quest: GET http://en.wikipedia.org/w/index.php?title=Sharon_Aguilar&action=history, from 208.80.152.70 via sq66.wikimedia.org (squid/2.7.STABLE9) to () Error: ERR_CANNOT_FORWARD, errno (11) Resource temporarily unavailable at Sun, 27 Nov 2011 22:04:23 GMT

Request: GET http://en.wikipedia.org/w/index.php?title=Sharon_Aguilar&action=history, from 208.80.152.85 via sq66.wikimedia.org (squid/2.7.STABLE9) to () Error: ERR_CANNOT_FORWARD, errno (11) Resource temporarily unavailable at Sun, 27 Nov 2011 22:08:30 GMT

Request: GET http://en.wikipedia.org/w/index.php?title=Sharon_Aguilar&action=history, from 208.80.152.75 via sq66.wikimedia.org (squid/2.7.STABLE9) to () Error: ERR_CANNOT_FORWARD, errno (11) Resource temporarily unavailable at Sun, 27 Nov 2011 22:09:55 GMT

Request: GET http://en.wikipedia.org/w/index.php?title=Sharon_Aguilar&action=history, from 208.80.152.50 via sq66.wikimedia.org (squid/2.7.STABLE9) to () Error: ERR_CANNOT_FORWARD, errno (11) Resource temporarily unavailable at Sun, 27 Nov 2011 22:14:53 GMT

Show replies by date

Dan Collins

27 Nov 27 Nov

2:25 p.m.

I am also getting this from en.m.wikipedia.org:

Error 503 Service Unavailable

Service Unavailable

Guru Meditation:

XID: 1592365530

Varnish cache server

On android, verizon 4g, default browser, 10+ minutes On Nov 27, 2011 5:18 PM, "William Allen Simpson" < william.allen.simpson@gmail.com> wrote:

...

https://en.wikipedia.org/w/index.php?title=Sharon_Aguilar&action=history

quest: GET http://en.wikipedia.org/w/index.php?title=Sharon_Aguilar&action=history, from 208.80.152.70 via sq66.wikimedia.org (squid/2.7.STABLE9) to () Error: ERR_CANNOT_FORWARD, errno (11) Resource temporarily unavailable at Sun, 27 Nov 2011 22:04:23 GMT

Request: GET http://en.wikipedia.org/w/index.php?title=Sharon_Aguilar&action=history, from 208.80.152.85 via sq66.wikimedia.org (squid/2.7.STABLE9) to () Error: ERR_CANNOT_FORWARD, errno (11) Resource temporarily unavailable at Sun, 27 Nov 2011 22:08:30 GMT

Request: GET http://en.wikipedia.org/w/index.php?title=Sharon_Aguilar&action=history, from 208.80.152.75 via sq66.wikimedia.org (squid/2.7.STABLE9) to () Error: ERR_CANNOT_FORWARD, errno (11) Resource temporarily unavailable at Sun, 27 Nov 2011 22:09:55 GMT

Request: GET http://en.wikipedia.org/w/index.php?title=Sharon_Aguilar&action=history, from 208.80.152.50 via sq66.wikimedia.org (squid/2.7.STABLE9) to () Error: ERR_CANNOT_FORWARD, errno (11) Resource temporarily unavailable at Sun, 27 Nov 2011 22:14:53 GMT

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

scs＠eskimo.com

2:41 p.m.

It's back up.

Erik Moeller

2:39 p.m.

We had a site outage of about 30 mins, caused by a major issue, potentially hardware-related, with a database server, which blocked all MediaWiki application servers (and thereby rendered most of our sites unusable). Should be fixed now; we'll prepare a more comprehensive incident analysis soon.

Thanks to the ops team for their speedy response.

All best, Erik

-- Erik Möller VP of Engineering and Product Development, Wikimedia Foundation Support Free Knowledge: http://wikimediafoundation.org/wiki/Donate

Asher Feldman

3:21 p.m.

It appears that we were actually taken down by the reddit community, after a link to the fundraising stats page was posted under Brandon's IAMA there.

sq71.wikimedia.org 943326197 2011-11-27T22:51:09.075 62032 109.125.42.71 TCP_MISS/200 1035 GET http://wikimediafoundation.org/wiki/Special:FundraiserStatistics ANY_PARENT/ 208.80.152.47 text/html * http://www.reddit.com/r/IAmA/comments/mr4pf/i_am_wikipedia_programmer_brando... * - Mozilla/5.0%20(Windows%20NT%206.1;%20WOW64)%20AppleWebKit/535.2%20(KHTML,%20like%20Gecko)%20Chrome/15.0.874.121%20Safari/535.2

That page wasn't suitable for high volume public consumption (very expensive db query + not properly cached), so the site problem persisted even after the db initially suspected as bad was rotated out.

On Sun, Nov 27, 2011 at 2:39 PM, Erik Moeller erik@wikimedia.org wrote:

...

We had a site outage of about 30 mins, caused by a major issue, potentially hardware-related, with a database server, which blocked all MediaWiki application servers (and thereby rendered most of our sites unusable). Should be fixed now; we'll prepare a more comprehensive incident analysis soon.

Thanks to the ops team for their speedy response.

All best, Erik -- Erik Möller VP of Engineering and Product Development, Wikimedia Foundation

Support Free Knowledge: http://wikimediafoundation.org/wiki/Donate

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Roan Kattouw

28 Nov 28 Nov

3:13 a.m.

On Mon, Nov 28, 2011 at 12:21 AM, Asher Feldman afeldman@wikimedia.org wrote:

...

That page wasn't suitable for high volume public consumption (very expensive db query + not properly cached), so the site problem persisted even after the db initially suspected as bad was rotated out.

What happened to it? When this page was introduced, it did have proper caching in memcached. Was that removed? Or did we get a cache stampede?

Roan

MZMcBride

6:50 a.m.

Roan Kattouw wrote:

...

On Mon, Nov 28, 2011 at 12:21 AM, Asher Feldman afeldman@wikimedia.org wrote:

...
That page wasn't suitable for high volume public consumption (very expensive db query + not properly cached), so the site problem persisted even after the db initially suspected as bad was rotated out.

What happened to it? When this page was introduced, it did have proper caching in memcached. Was that removed? Or did we get a cache stampede?

I asked roughly the same thing yesterday (more along the lines of "shouldn't it take someone ten minutes to add memcache support to the extension?"). Reedy said it was long-running queries that never timed out that apparently caused the issue.

The ContributionReporting extension being disabled is being tracked here: https://bugzilla.wikimedia.org/show_bug.cgi?id=32679.

MZMcBride

Roan Kattouw

6:57 a.m.

On Mon, Nov 28, 2011 at 3:50 PM, MZMcBride z@mzmcbride.com wrote:

...

I asked roughly the same thing yesterday (more along the lines of "shouldn't it take someone ten minutes to add memcache support to the extension?"). Reedy said it was long-running queries that never timed out that apparently caused the issue.

I read the code and sent some unsolicited advice to the fundraising team. Essentially, the recaching operation could be doing ~35 times fewer queries, that should help.

Roan

Neil Harris

11:28 a.m.

On 28/11/11 14:57, Roan Kattouw wrote:

...

On Mon, Nov 28, 2011 at 3:50 PM, MZMcBridez@mzmcbride.com wrote:

...
I asked roughly the same thing yesterday (more along the lines of "shouldn't it take someone ten minutes to add memcache support to the extension?"). Reedy said it was long-running queries that never timed out that apparently caused the issue.

I read the code and sent some unsolicited advice to the fundraising team. Essentially, the recaching operation could be doing ~35 times fewer queries, that should help.

Roan

And adding memcached caching with even, say, as little as a 1 minute cache entry timeout, should dilute that reduced load even more, and put an upperbound on the load generated, just in case it gets slashdot/reddited again.

-- Neil

Roan Kattouw

11:36 a.m.

On Mon, Nov 28, 2011 at 8:28 PM, Neil Harris neil@tonal.clara.co.uk wrote:

...

And adding memcached caching with even, say, as little as a 1 minute cache entry timeout, should dilute that reduced load even more, and put an upperbound on the load generated, just in case it gets slashdot/reddited again.

It was already in memcached, cached for 15 minutes. However, if recaching takes a long time and your page gets a lot of traffic, you can get a cache stampede (just like when Michael Jackson died): while the recache is in progress, there are more hits for your page and a zillion Apache workers all race to rebuild the cache, unaware of each other. I have no evidence that that's what happened, but that's my theory. Making the recache faster and/or upping the cache timeout reduces the size and the frequency, respectively, of the window in which this can happen.

The cache stampede problem was solved for the particular case of the parser cache using PoolCounter, but I don't think it's necessary for other types of caching. Computing fundraiser statistics simply shouldn't be that slow.

Roan

Neil Harris

11:59 a.m.

On 28/11/11 19:36, Roan Kattouw wrote:

...

On Mon, Nov 28, 2011 at 8:28 PM, Neil Harrisneil@tonal.clara.co.uk wrote:

...
And adding memcached caching with even, say, as little as a 1 minute cache entry timeout, should dilute that reduced load even more, and put an upperbound on the load generated, just in case it gets slashdot/reddited again.

It was already in memcached, cached for 15 minutes. However, if recaching takes a long time and your page gets a lot of traffic, you can get a cache stampede (just like when Michael Jackson died): while the recache is in progress, there are more hits for your page and a zillion Apache workers all race to rebuild the cache, unaware of each other. I have no evidence that that's what happened, but that's my theory. Making the recache faster and/or upping the cache timeout reduces the size and the frequency, respectively, of the window in which this can happen.

The cache stampede problem was solved for the particular case of the parser cache using PoolCounter, but I don't think it's necessary for other types of caching. Computing fundraiser statistics simply shouldn't be that slow.

Roan

I hadn't thought properly about cache stampedes: since the parser cache is only part of page rendering, this might also explain some of the other occasional slowdowns I've seen on Wikipedia.

It would be really cool if there could be some sort of general mechanism to enable this to be prevented this for all page URLs protected by memcaching, throughout the system.

-- N.

Roan Kattouw

12:06 p.m.

On Mon, Nov 28, 2011 at 8:59 PM, Neil Harris neil@tonal.clara.co.uk wrote:

...

I hadn't thought properly about cache stampedes: since the parser cache is only part of page rendering, this might also explain some of the other occasional slowdowns I've seen on Wikipedia.

It would be really cool if there could be some sort of general mechanism to enable this to be prevented this for all page URLs protected by memcaching, throughout the system.

I'm not very familiar with PoolCounter but I suspect it's a fairly generic system for handling this sort of thing. However, stampedes have never been a practical problem for anything except massive traffic combined with slow recaching, and that's a fairly rare case. So I don't think we want to add that sort of concurrency protection everywhere.

Roan

Asher Feldman

12:26 p.m.

On Mon, Nov 28, 2011 at 12:06 PM, Roan Kattouw roan.kattouw@gmail.comwrote:

...

On Mon, Nov 28, 2011 at 8:59 PM, Neil Harris neil@tonal.clara.co.uk wrote:

...
I hadn't thought properly about cache stampedes: since the parser cache

is

...
only part of page rendering, this might also explain some of the other occasional slowdowns I've seen on Wikipedia.

It would be really cool if there could be some sort of general mechanism

to

...
enable this to be prevented this for all page URLs protected by

memcaching,

...
throughout the system.

I'm not very familiar with PoolCounter but I suspect it's a fairly generic system for handling this sort of thing. However, stampedes have never been a practical problem for anything except massive traffic combined with slow recaching, and that's a fairly rare case. So I don't think we want to add that sort of concurrency protection everywhere.

For memcache objects that can be grouped together into an "ok to use if a bit stale" bucket (such as all kinds of stats), there is also the possibility of lazy async regeneration.

Data is stored in memcache with a fuzzy expire time, i..e { data:foo, stale:$now+15min } and a cache ttl of forever. When getting the key, if the time stamp inside marks the data as stale, you can 1) attempt to obtain a exclusive (acq4me) lock from poolcounter. If immediately successful, launch an async job to regenerate the cache (while holding the lock) but continue the request with stale data. In all other cases, just use the stale data. Mainly useful if the regeneration work is hideously expensive, such that you wouldn't want clients blocking on even a single cache regen (as is the behavior with poolcounter as deployed for the parser cache.)

Platonides

2:23 p.m.

On 28/11/11 21:26, Asher Feldman wrote:

...

For memcache objects that can be grouped together into an "ok to use if a bit stale" bucket (such as all kinds of stats), there is also the possibility of lazy async regeneration.

Data is stored in memcache with a fuzzy expire time, i..e { data:foo, stale:$now+15min } and a cache ttl of forever. When getting the key, if the time stamp inside marks the data as stale, you can 1) attempt to obtain a exclusive (acq4me) lock from poolcounter. If immediately successful, launch an async job to regenerate the cache (while holding the lock) but continue the request with stale data. In all other cases, just use the stale data. Mainly useful if the regeneration work is hideously expensive, such that you wouldn't want clients blocking on even a single cache regen (as is the behavior with poolcounter as deployed for the parser cache.)

I see you looked at the poolcounter code :) Yes, it would be usefult to have a class handling that kind of that, which would store data valid for a known time (usually a guess) with a slightly longer expiry, packed with a timestamp. And if it was overdue, launch an update protected with an acq4any with 0 queue. I hadn't considered showing "stale" data for that first hit, but could be easily done through DeferredUpdates. You only need to be careful not to be reentrant to the same key, since you might deadlock (although with a 0-queue that's unlikely).

K. Peachey

12:32 p.m.

On Tue, Nov 29, 2011 at 12:57 AM, Roan Kattouw roan.kattouw@gmail.com wrote:

...

On Mon, Nov 28, 2011 at 3:50 PM, MZMcBride z@mzmcbride.com wrote:

...
I asked roughly the same thing yesterday (more along the lines of "shouldn't it take someone ten minutes to add memcache support to the extension?"). Reedy said it was long-running queries that never timed out that apparently caused the issue.

I read the code and sent some unsolicited advice to the fundraising team. Essentially, the recaching operation could be doing ~35 times fewer queries, that should help.

Roan

Wasn't the extension ever reviewed before being enabled? Shouldn't of the review catch-ed this?

Roan Kattouw

12:35 p.m.

On Mon, Nov 28, 2011 at 9:32 PM, K. Peachey p858snake@gmail.com wrote:

...

Wasn't the extension ever reviewed before being enabled? Shouldn't of the review catch-ed this?

The relevant code wasn't present in the extension when it was originally enabled (in 2009 I think), it was introduced this year. I don't know who reviewed those changes, all I know is it wasn't me.

Roan

4620

Age (days ago)

4621

Last active (days ago)

wikitech-l@lists.wikimedia.org

15 comments

10 participants

tags (0)

participants (10)

Asher Feldman
Dan Collins
Erik Moeller
K. Peachey
MZMcBride
Neil Harris
Platonides
Roan Kattouw
scs＠eskimo.com
William Allen Simpson