https://en.wikipedia.org/w/index.php?title=Sharon_Aguilar&action=history
quest: GET http://en.wikipedia.org/w/index.php?title=Sharon_Aguilar&action=history, from 208.80.152.70 via sq66.wikimedia.org (squid/2.7.STABLE9) to () Error: ERR_CANNOT_FORWARD, errno (11) Resource temporarily unavailable at Sun, 27 Nov 2011 22:04:23 GMT
Request: GET http://en.wikipedia.org/w/index.php?title=Sharon_Aguilar&action=history, from 208.80.152.85 via sq66.wikimedia.org (squid/2.7.STABLE9) to () Error: ERR_CANNOT_FORWARD, errno (11) Resource temporarily unavailable at Sun, 27 Nov 2011 22:08:30 GMT
Request: GET http://en.wikipedia.org/w/index.php?title=Sharon_Aguilar&action=history, from 208.80.152.75 via sq66.wikimedia.org (squid/2.7.STABLE9) to () Error: ERR_CANNOT_FORWARD, errno (11) Resource temporarily unavailable at Sun, 27 Nov 2011 22:09:55 GMT
Request: GET http://en.wikipedia.org/w/index.php?title=Sharon_Aguilar&action=history, from 208.80.152.50 via sq66.wikimedia.org (squid/2.7.STABLE9) to () Error: ERR_CANNOT_FORWARD, errno (11) Resource temporarily unavailable at Sun, 27 Nov 2011 22:14:53 GMT
I am also getting this from en.m.wikipedia.org:
Error 503 Service Unavailable
Service Unavailable
Guru Meditation:
XID: 1592365530
Varnish cache server
On android, verizon 4g, default browser, 10+ minutes On Nov 27, 2011 5:18 PM, "William Allen Simpson" < william.allen.simpson@gmail.com> wrote:
https://en.wikipedia.org/w/index.php?title=Sharon_Aguilar&action=history
quest: GET http://en.wikipedia.org/w/index.php?title=Sharon_Aguilar&action=history, from 208.80.152.70 via sq66.wikimedia.org (squid/2.7.STABLE9) to () Error: ERR_CANNOT_FORWARD, errno (11) Resource temporarily unavailable at Sun, 27 Nov 2011 22:04:23 GMT
Request: GET http://en.wikipedia.org/w/index.php?title=Sharon_Aguilar&action=history, from 208.80.152.85 via sq66.wikimedia.org (squid/2.7.STABLE9) to () Error: ERR_CANNOT_FORWARD, errno (11) Resource temporarily unavailable at Sun, 27 Nov 2011 22:08:30 GMT
Request: GET http://en.wikipedia.org/w/index.php?title=Sharon_Aguilar&action=history, from 208.80.152.75 via sq66.wikimedia.org (squid/2.7.STABLE9) to () Error: ERR_CANNOT_FORWARD, errno (11) Resource temporarily unavailable at Sun, 27 Nov 2011 22:09:55 GMT
Request: GET http://en.wikipedia.org/w/index.php?title=Sharon_Aguilar&action=history, from 208.80.152.50 via sq66.wikimedia.org (squid/2.7.STABLE9) to () Error: ERR_CANNOT_FORWARD, errno (11) Resource temporarily unavailable at Sun, 27 Nov 2011 22:14:53 GMT
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
We had a site outage of about 30 mins, caused by a major issue, potentially hardware-related, with a database server, which blocked all MediaWiki application servers (and thereby rendered most of our sites unusable). Should be fixed now; we'll prepare a more comprehensive incident analysis soon.
Thanks to the ops team for their speedy response.
All best, Erik
It appears that we were actually taken down by the reddit community, after a link to the fundraising stats page was posted under Brandon's IAMA there.
sq71.wikimedia.org 943326197 2011-11-27T22:51:09.075 62032 109.125.42.71 TCP_MISS/200 1035 GET http://wikimediafoundation.org/wiki/Special:FundraiserStatistics ANY_PARENT/ 208.80.152.47 text/html * http://www.reddit.com/r/IAmA/comments/mr4pf/i_am_wikipedia_programmer_brando... * - Mozilla/5.0%20(Windows%20NT%206.1;%20WOW64)%20AppleWebKit/535.2%20(KHTML,%20like%20Gecko)%20Chrome/15.0.874.121%20Safari/535.2
That page wasn't suitable for high volume public consumption (very expensive db query + not properly cached), so the site problem persisted even after the db initially suspected as bad was rotated out.
On Sun, Nov 27, 2011 at 2:39 PM, Erik Moeller erik@wikimedia.org wrote:
We had a site outage of about 30 mins, caused by a major issue, potentially hardware-related, with a database server, which blocked all MediaWiki application servers (and thereby rendered most of our sites unusable). Should be fixed now; we'll prepare a more comprehensive incident analysis soon.
Thanks to the ops team for their speedy response.
All best, Erik -- Erik Möller VP of Engineering and Product Development, Wikimedia Foundation
Support Free Knowledge: http://wikimediafoundation.org/wiki/Donate
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Mon, Nov 28, 2011 at 12:21 AM, Asher Feldman afeldman@wikimedia.org wrote:
That page wasn't suitable for high volume public consumption (very expensive db query + not properly cached), so the site problem persisted even after the db initially suspected as bad was rotated out.
What happened to it? When this page was introduced, it did have proper caching in memcached. Was that removed? Or did we get a cache stampede?
Roan
Roan Kattouw wrote:
On Mon, Nov 28, 2011 at 12:21 AM, Asher Feldman afeldman@wikimedia.org wrote:
That page wasn't suitable for high volume public consumption (very expensive db query + not properly cached), so the site problem persisted even after the db initially suspected as bad was rotated out.
What happened to it? When this page was introduced, it did have proper caching in memcached. Was that removed? Or did we get a cache stampede?
I asked roughly the same thing yesterday (more along the lines of "shouldn't it take someone ten minutes to add memcache support to the extension?"). Reedy said it was long-running queries that never timed out that apparently caused the issue.
The ContributionReporting extension being disabled is being tracked here: https://bugzilla.wikimedia.org/show_bug.cgi?id=32679.
MZMcBride
On Mon, Nov 28, 2011 at 3:50 PM, MZMcBride z@mzmcbride.com wrote:
I asked roughly the same thing yesterday (more along the lines of "shouldn't it take someone ten minutes to add memcache support to the extension?"). Reedy said it was long-running queries that never timed out that apparently caused the issue.
I read the code and sent some unsolicited advice to the fundraising team. Essentially, the recaching operation could be doing ~35 times fewer queries, that should help.
Roan
On 28/11/11 14:57, Roan Kattouw wrote:
On Mon, Nov 28, 2011 at 3:50 PM, MZMcBridez@mzmcbride.com wrote:
I asked roughly the same thing yesterday (more along the lines of "shouldn't it take someone ten minutes to add memcache support to the extension?"). Reedy said it was long-running queries that never timed out that apparently caused the issue.
I read the code and sent some unsolicited advice to the fundraising team. Essentially, the recaching operation could be doing ~35 times fewer queries, that should help.
Roan
And adding memcached caching with even, say, as little as a 1 minute cache entry timeout, should dilute that reduced load even more, and put an upperbound on the load generated, just in case it gets slashdot/reddited again.
-- Neil
On Mon, Nov 28, 2011 at 8:28 PM, Neil Harris neil@tonal.clara.co.uk wrote:
And adding memcached caching with even, say, as little as a 1 minute cache entry timeout, should dilute that reduced load even more, and put an upperbound on the load generated, just in case it gets slashdot/reddited again.
It was already in memcached, cached for 15 minutes. However, if recaching takes a long time and your page gets a lot of traffic, you can get a cache stampede (just like when Michael Jackson died): while the recache is in progress, there are more hits for your page and a zillion Apache workers all race to rebuild the cache, unaware of each other. I have no evidence that that's what happened, but that's my theory. Making the recache faster and/or upping the cache timeout reduces the size and the frequency, respectively, of the window in which this can happen.
The cache stampede problem was solved for the particular case of the parser cache using PoolCounter, but I don't think it's necessary for other types of caching. Computing fundraiser statistics simply shouldn't be that slow.
Roan
On 28/11/11 19:36, Roan Kattouw wrote:
On Mon, Nov 28, 2011 at 8:28 PM, Neil Harrisneil@tonal.clara.co.uk wrote:
And adding memcached caching with even, say, as little as a 1 minute cache entry timeout, should dilute that reduced load even more, and put an upperbound on the load generated, just in case it gets slashdot/reddited again.
It was already in memcached, cached for 15 minutes. However, if recaching takes a long time and your page gets a lot of traffic, you can get a cache stampede (just like when Michael Jackson died): while the recache is in progress, there are more hits for your page and a zillion Apache workers all race to rebuild the cache, unaware of each other. I have no evidence that that's what happened, but that's my theory. Making the recache faster and/or upping the cache timeout reduces the size and the frequency, respectively, of the window in which this can happen.
The cache stampede problem was solved for the particular case of the parser cache using PoolCounter, but I don't think it's necessary for other types of caching. Computing fundraiser statistics simply shouldn't be that slow.
Roan
I hadn't thought properly about cache stampedes: since the parser cache is only part of page rendering, this might also explain some of the other occasional slowdowns I've seen on Wikipedia.
It would be really cool if there could be some sort of general mechanism to enable this to be prevented this for all page URLs protected by memcaching, throughout the system.
-- N.
On Mon, Nov 28, 2011 at 8:59 PM, Neil Harris neil@tonal.clara.co.uk wrote:
I hadn't thought properly about cache stampedes: since the parser cache is only part of page rendering, this might also explain some of the other occasional slowdowns I've seen on Wikipedia.
It would be really cool if there could be some sort of general mechanism to enable this to be prevented this for all page URLs protected by memcaching, throughout the system.
I'm not very familiar with PoolCounter but I suspect it's a fairly generic system for handling this sort of thing. However, stampedes have never been a practical problem for anything except massive traffic combined with slow recaching, and that's a fairly rare case. So I don't think we want to add that sort of concurrency protection everywhere.
Roan
On Mon, Nov 28, 2011 at 12:06 PM, Roan Kattouw roan.kattouw@gmail.comwrote:
On Mon, Nov 28, 2011 at 8:59 PM, Neil Harris neil@tonal.clara.co.uk wrote:
I hadn't thought properly about cache stampedes: since the parser cache
is
only part of page rendering, this might also explain some of the other occasional slowdowns I've seen on Wikipedia.
It would be really cool if there could be some sort of general mechanism
to
enable this to be prevented this for all page URLs protected by
memcaching,
throughout the system.
I'm not very familiar with PoolCounter but I suspect it's a fairly generic system for handling this sort of thing. However, stampedes have never been a practical problem for anything except massive traffic combined with slow recaching, and that's a fairly rare case. So I don't think we want to add that sort of concurrency protection everywhere.
For memcache objects that can be grouped together into an "ok to use if a bit stale" bucket (such as all kinds of stats), there is also the possibility of lazy async regeneration.
Data is stored in memcache with a fuzzy expire time, i..e { data:foo, stale:$now+15min } and a cache ttl of forever. When getting the key, if the time stamp inside marks the data as stale, you can 1) attempt to obtain a exclusive (acq4me) lock from poolcounter. If immediately successful, launch an async job to regenerate the cache (while holding the lock) but continue the request with stale data. In all other cases, just use the stale data. Mainly useful if the regeneration work is hideously expensive, such that you wouldn't want clients blocking on even a single cache regen (as is the behavior with poolcounter as deployed for the parser cache.)
On 28/11/11 21:26, Asher Feldman wrote:
For memcache objects that can be grouped together into an "ok to use if a bit stale" bucket (such as all kinds of stats), there is also the possibility of lazy async regeneration.
Data is stored in memcache with a fuzzy expire time, i..e { data:foo, stale:$now+15min } and a cache ttl of forever. When getting the key, if the time stamp inside marks the data as stale, you can 1) attempt to obtain a exclusive (acq4me) lock from poolcounter. If immediately successful, launch an async job to regenerate the cache (while holding the lock) but continue the request with stale data. In all other cases, just use the stale data. Mainly useful if the regeneration work is hideously expensive, such that you wouldn't want clients blocking on even a single cache regen (as is the behavior with poolcounter as deployed for the parser cache.)
I see you looked at the poolcounter code :) Yes, it would be usefult to have a class handling that kind of that, which would store data valid for a known time (usually a guess) with a slightly longer expiry, packed with a timestamp. And if it was overdue, launch an update protected with an acq4any with 0 queue. I hadn't considered showing "stale" data for that first hit, but could be easily done through DeferredUpdates. You only need to be careful not to be reentrant to the same key, since you might deadlock (although with a 0-queue that's unlikely).
On Tue, Nov 29, 2011 at 12:57 AM, Roan Kattouw roan.kattouw@gmail.com wrote:
On Mon, Nov 28, 2011 at 3:50 PM, MZMcBride z@mzmcbride.com wrote:
I asked roughly the same thing yesterday (more along the lines of "shouldn't it take someone ten minutes to add memcache support to the extension?"). Reedy said it was long-running queries that never timed out that apparently caused the issue.
I read the code and sent some unsolicited advice to the fundraising team. Essentially, the recaching operation could be doing ~35 times fewer queries, that should help.
Roan
Wasn't the extension ever reviewed before being enabled? Shouldn't of the review catch-ed this?
On Mon, Nov 28, 2011 at 9:32 PM, K. Peachey p858snake@gmail.com wrote:
Wasn't the extension ever reviewed before being enabled? Shouldn't of the review catch-ed this?
The relevant code wasn't present in the extension when it was originally enabled (in 2009 I think), it was introduced this year. I don't know who reviewed those changes, all I know is it wasn't me.
Roan
wikitech-l@lists.wikimedia.org