From the Ops & Multimedia mailing lists:
We just had a brief imagescaler outage today at approx. 11:20 UTC that
was investigated and NYPL maps were found to be the
cause of the outage.
As Gergo's unanswered recent message in this thread suggested, we're
actively working on a number of changes to stabilize GWToolset and improve
image scaler performance in order to avoid such outages. I assumed that
since everyone involved is participating in this thread, that you were
waiting for these changes to happen before restarting the GWToolset job
that caused the previous outage a couple of weeks ago, or that you would
warn us when that job would be run again. There seems to be a communication
issue here. By running this job, you've taken down thumbnail generation on
Commons (and all WMF wikis) and we were lucky that someone from Ops was
around, noticed it and reacted quickly. This could have been easily avoided
with better coordination, by at least scheduling a time to run your next
attempt, with people from Ops watching servers at the time the job is run.
Please make sure that this happens for the next batch of NYPL maps/massive
files that you plan to upload with GWToolset. All it takes is scheduling a
day and time for the next upload attempt.
Gergo and I will keep replying to this thread to notify everyone when our
related code changes are merged.
On Wed, May 7, 2014 at 10:26 PM, Gergo Tisza <gtisza(a)wikimedia.org> wrote:
Uhh... let's give this another shot in the
morning.
I went through last day's upload logs; on average there are ~600 uploads
an hour, the peak was 1900, the negative peak around 240. (The numbers are
at
http://pastebin.com/raw.php?i=wmBRJm1G in case anybody finds them
useful.) So that's around 4 files per minute in worst case.
If we are aiming for no more than 10% of Special:NewFiles to be taken up
by GWToolset, that means 5 uploads per run of the control job (10% of the
50 slots at Special:NewFiles) - the upload jobs can't really be throttled,
so we must make sure they come in small enough chunks, no matter how much
delay there is between the chunks). Also, we want to keep below 10% of the
total Commons upload rate - that means 24 images per hour, which is roughly
five runs of the control job per hour.
So the correct config is
GWToolset\Config::$mediafile_job_throttle_default = 5;
$wgJobBackoffThrottling['gwtoolsetUploadMetadataJob'] = 5 / 3600;
I'm leaving the max throttle at 20 so that people who are uploading small,
non-TIFF images can get a somewhat higher speed.