From the Ops & Multimedia mailing lists:
We just had a brief imagescaler outage today at approx. 11:20 UTC that
was investigated and NYPL maps were found to be the cause of the outage.
As Gergo's unanswered recent message in this thread suggested, we're actively working on a number of changes to stabilize GWToolset and improve image scaler performance in order to avoid such outages. I assumed that since everyone involved is participating in this thread, that you were waiting for these changes to happen before restarting the GWToolset job that caused the previous outage a couple of weeks ago, or that you would warn us when that job would be run again. There seems to be a communication issue here. By running this job, you've taken down thumbnail generation on Commons (and all WMF wikis) and we were lucky that someone from Ops was around, noticed it and reacted quickly. This could have been easily avoided with better coordination, by at least scheduling a time to run your next attempt, with people from Ops watching servers at the time the job is run. Please make sure that this happens for the next batch of NYPL maps/massive files that you plan to upload with GWToolset. All it takes is scheduling a day and time for the next upload attempt.
Gergo and I will keep replying to this thread to notify everyone when our related code changes are merged.