On 11-03-13 06:05 PM, Tim Starling wrote:
On 14/03/11 11:48, William Allen Simpson wrote:
Secure basically fell over for awhile, generated nothing but proxy errors. I'm not sure that's what really happened. It may have been a complete inability to actually send or receive data, resulting in a timeout of some sort.
Take a look at the Ganglia graphs. Free memory gone. Big spike in processes. Big drop in network activity!
It was because of the CPU overload on the entire apache cluster which occurred at that time. Secure and every other frontend proxy would have reported errors. Domas and I traced it back to job queue cache invalidations from an edit to [[Template:Reflist]] on the English Wikipedia.
Note that the free memory isn't gone. RRDtool has the very unscientific practice of starting the vertical scale at something other than zero. It rose because processes use memory, and as you noted, the number of processes increased. This is because they were queueing, waiting for the overloaded backend cluster to serve them.
-- Tim Starling
Interesting. Which part specifically do you think actually caused the extreme load? Having to re-parse a large number of pages as people view them? Did the issue show up from invalidation pre-queue, or did the issue crop up after the jobs were run? Was this just isolated to the secure servers, ie: didn't really effect the whole cluster but was simply an issue because secure doesn't have as large a deployment as non-secure?
~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name]