On Mon, Apr 21, 2014 at 8:04 AM, Ori Livneh <ori@wikimedia.org> wrote:

The number of Apache busy workers on the image scalers spiked between 2:55 and 3:15 UTC, peaking at about 3:12 and overwhelming rendering.svc.eqiad.wmnet for about a minute.

The outage correlates fairly well with a spike of fatals in TimedMediaHandler, consisting almost entirely of requests to this URL: <http://commons.wikimedia.org/w/thumb_handler.php/2/2c/Closed_Friedmann_universe_zero_Lambda.ogg/220px--Closed_Friedmann_universe_zero_Lambda.ogg.jpg>.

The full stack trace is included in <https://bugzilla.wikimedia.org/show_bug.cgi?id=64152>, filed by Reedy yesterday. It appears File::getMimeType is returning 'unknown/unknown' and that File::getHandler is consequently not able to find a handler.

The problem has happened again this morning between 8:25 and 8:35 UTC. This time the load was so high that ganglia stopped graphing data. From an analysis of the logs, while it is true we have a lot of fatals for that url above, it is also true that the number of requests for that url is quite low and does not present a spike in that interval. So the problem is genuine load and that is probably caused by some large processing.

The problem resolved before I could get to strace the apache processes, so I don't have more details - Faidon was investigating as well and may have more info.

Giuseppe