On Mon, Apr 21, 2014 at 8:04 AM, Ori Livneh ori@wikimedia.org wrote:
The number of Apache busy workers on the image scalers spiked between 2:55 and 3:15 UTC, peaking at about 3:12 and overwhelming rendering.svc.eqiad.wmnet for about a minute.
The outage correlates fairly well with a spike of fatals in TimedMediaHandler, consisting almost entirely of requests to this URL: < http://commons.wikimedia.org/w/thumb_handler.php/2/2c/Closed_Friedmann_unive...
.
The full stack trace is included in < https://bugzilla.wikimedia.org/show_bug.cgi?id=64152%3E, filed by Reedy yesterday. It appears File::getMimeType is returning 'unknown/unknown' and that File::getHandler is consequently not able to find a handler.
The problem has happened again this morning between 8:25 and 8:35 UTC. This time the load was so high that ganglia stopped graphing data. From an analysis of the logs, while it is true we have a lot of fatals for that url above, it is also true that the number of requests for that url is quite low and does not present a spike in that interval. So the problem is genuine load and that is probably caused by some large processing.
The problem resolved before I could get to strace the apache processes, so I don't have more details - Faidon was investigating as well and may have more info.
Giuseppe