Just a quick note -- we're experiencing some fun load spikes due to heavy net usage of people searching for or talking about Michael Jackson's reported death or near-death.
You may see some intermittent database connection failures on en.wikipedia.org for a little while as connections back up; we're poking to see if we can reduce this.
Updates at tech blog: http://techblog.wikimedia.org/2009/06/current-events/
-- brion
Hi!
Just a quick note -- we're experiencing some fun load spikes due to heavy net usage of people searching for or talking about Michael Jackson's reported death or near-death.
tech blog doesn't seem to get much updates.
The problem is quite simple, lots of people (like, million pageviews on an article in an hour) caused a cache stampede (all pageviews between invalidation and re-rendering needed parsing), and as MJ article is quite cite-heavy (and cite problems were outlined in http://article.gmane.org/gmane.science.linguistics.wikipedia.technical/41547 ;) the reparsing was very very painful on our application cluster - all apache children eventually ended up doing lots of parsing work and consuming connection slots to pretty much everything :)
Cheers, Domas
On Thu, Jun 25, 2009 at 8:14 PM, Domas Mituzasmidom.lists@gmail.com wrote:
The problem is quite simple, lots of people (like, million pageviews on an article in an hour) caused a cache stampede (all pageviews between invalidation and re-rendering needed parsing), and as MJ article is quite cite-heavy (and cite problems were outlined in http://article.gmane.org/gmane.science.linguistics.wikipedia.technical/41547 ;) the reparsing was very very painful on our application cluster - all apache children eventually ended up doing lots of parsing work and consuming connection slots to pretty much everything :)
So if two page views are trying to view the same uncached page at the same time with the same settings, the later ones should all block on the first one's reparsing instead of doing it themselves. It should provide faster service for big articles too, even ignoring load, since the earlier parse will be done before you could finish yours anyway.
That seems pretty easy to do. You'd have some delays if everything waited on a process that died or something, of course.
Aryeh Gregor wrote:
On Thu, Jun 25, 2009 at 8:14 PM, Domas Mituzasmidom.lists@gmail.com wrote:
The problem is quite simple, lots of people (like, million pageviews on an article in an hour) caused a cache stampede (all pageviews between invalidation and re-rendering needed parsing), and as MJ article is quite cite-heavy (and cite problems were outlined in http://article.gmane.org/gmane.science.linguistics.wikipedia.technical/41547 ;) the reparsing was very very painful on our application cluster - all apache children eventually ended up doing lots of parsing work and consuming connection slots to pretty much everything :)
So if two page views are trying to view the same uncached page at the same time with the same settings, the later ones should all block on the first one's reparsing instead of doing it themselves. It should provide faster service for big articles too, even ignoring load, since the earlier parse will be done before you could finish yours anyway.
It's quite a complex feature. If you have a server that deadlocks or is otherwise extremely slow, then it will block rendering for all other attempts, meaning that the article can not be viewed at all. That scenario could even lead to site-wide downtime, since threads waiting for the locks could consume all available apache threads, or all available DB connections.
It's a reasonable idea, but implementing it would require a careful design, and possibly some other concepts like per-article thread count limits.
-- Tim Starling
Tim Starling wrote:
It's quite a complex feature. If you have a server that deadlocks or is otherwise extremely slow, then it will block rendering for all other attempts, meaning that the article can not be viewed at all. That scenario could even lead to site-wide downtime, since threads waiting for the locks could consume all available apache threads, or all available DB connections.
It's a reasonable idea, but implementing it would require a careful design, and possibly some other concepts like per-article thread count limits.
*nod* We should definitely ponder the issue since it comes up intermittently but regularly with big news events like this. At the least if we can have some automatic threshold that temporarily disables or reduces hits on stampeded pages that'd be spiffy...
-- brion
2009/6/26 Brion Vibber brion@wikimedia.org:
Tim Starling wrote:
It's quite a complex feature. If you have a server that deadlocks or is otherwise extremely slow, then it will block rendering for all other attempts, meaning that the article can not be viewed at all. That scenario could even lead to site-wide downtime, since threads waiting for the locks could consume all available apache threads, or all available DB connections.
It's a reasonable idea, but implementing it would require a careful design, and possibly some other concepts like per-article thread count limits.
*nod* We should definitely ponder the issue since it comes up intermittently but regularly with big news events like this. At the least if we can have some automatic threshold that temporarily disables or reduces hits on stampeded pages that'd be spiffy...
Of course, the fact that everyone's first port of call after hearing such news is to check the Wikipedia page is a fantastic thing, so it would be really unfortunate if we have to stop people doing that. Would it be possible, perhaps, to direct all requests for a certain page through one server so the rest can continue to serve the rest of the site unaffected? Or perhaps excessively popular pages could be rendered (for anons) as part of the editing process, rather than the viewing process, since that would mean each version of the article is rendered only once (for anons) and would just slow down editing slightly (presumably by a fraction of a second), which we can live with. There must be something we can do that allows people to continue viewing the page wherever possible.
On Fri, Jun 26, 2009 at 6:33 AM, Thomas Daltonthomas.dalton@gmail.com wrote:
Of course, the fact that everyone's first port of call after hearing such news is to check the Wikipedia page is a fantastic thing, so it would be really unfortunate if we have to stop people doing that.
He didn't say we'd shut down views for the article, just we'd shut down reparsing or cache invalidation or something. This is the live hack that was applied yesterday:
Index: includes/parser/ParserCache.php =================================================================== --- includes/parser/ParserCache.php (revision 52359) +++ includes/parser/ParserCache.php (working copy) -63,6 +63,7 @@ if ( is_object( $value ) ) { wfDebug( "Found.\n" ); # Delete if article has changed since the cache was made + if( $article->mTitle->getPrefixedText() != 'Michael Jackson' ) { // temp hack! $canCache = $article->checkTouched(); $cacheTime = $value->getCacheTime(); $touched = $article->mTouched; -82,6 +83,7 @@ } wfIncrStats( "pcache_hit" ); } + }// temp hack! } else { wfDebug( "Parser cache miss.\n" ); wfIncrStats( "pcache_miss_absent" );
It just meant that people were seeing outdated versions of the article.
Would it be possible, perhaps, to direct all requests for a certain page through one server so the rest can continue to serve the rest of the site unaffected?
Every page view involves a number of servers, and they're not all interchangeable, so this doesn't make a lot of sense.
Or perhaps excessively popular pages could be rendered (for anons) as part of the editing process, rather than the viewing process, since that would mean each version of the article is rendered only once (for anons) and would just slow down editing slightly (presumably by a fraction of a second), which we can live with.
You think that parsing a large page takes a fraction of a second? Try twenty or thirty seconds.
But this sounds like a good idea. If a process is already parsing the page, why don't we just have other processes display an old cached version of the page instead of waiting or trying to reparse themselves? The worst that would happen is some users would get old views for a couple of minutes.
2009/6/26 Aryeh Gregor Simetrical+wikilist@gmail.com:
But this sounds like a good idea. If a process is already parsing the page, why don't we just have other processes display an old cached version of the page instead of waiting or trying to reparse themselves? The worst that would happen is some users would get old views for a couple of minutes.
This is a very good idea, and sounds much better than having those other processes wait for the first process to finish parsing. It would also reduce the severity of the deadlocks occurring when a process gets stuck on a parse or dies in the middle of it: instead of deadlocking, the other processes would just display stale versions instead of wasting time. If we design these parser cache locks to expire after a few minutes or so, it should work just fine.
Roan Kattouw (Catrope)
This is a very good idea, and sounds much better than having those
the major problem with all dirty caching is that we have more than one caching layer, and of course, things abort.
the fact, that people should be shown dirty versions instead of proper article leads to situation where in case of vandal fighting, etc, people will see stale versions, instead of waiting few seconds and getting real one.
In theory, update flow could look like this:
1. Set "I'm working on this" in a parallelism coordinator or lock manager 2. Do all database transactions & commit 3. Parse 4. Set memcached object 5. Invalidate squid objects
Now, should we parse, block or serve stale, could be dynamic, e.g. if we detect more than x parallel parses we fall back to blocking for few seconds, once we detect more than y of blocked threads on the task, or block expires and there's no fresh content yet (or there's new copy.. ) - then stale stuff can be served. In perfect world that asks for specialized software :)
Do note, for past quite a few years we did lots and lots of work to avoid stale content being served. I would not see dirty serving as something we should be proud of ;-)
Domas
wikitech-l@lists.wikimedia.org