On Tue, Apr 29, 2014 at 7:19 AM, dan entous d_entous@yahoo.com wrote:
the config values to most likely change would be $mediafile_job_throttle_default, which is currently set to 10 and $mediafile_job_throttle_max, which is currently set to 20.
at the moment, a user can set this throttle between 1-20. that means that every time a GWToolset Metadata background job is run between 1-20 GWToolset Mediafile jobs are added to the queue.
Just to verify that I am reading the code right:
- GWToolset has two main job types (and a third for the final cleanup, but that's not relevant now), UploadMetadataJob and UploadMediafileJob. The names are misleading: actually UploadMediafileJob does all the uploading, while UploadMetadataJob acts as a sort of daemon and spawns new UploadMediafileJob intances. Since they sound similar enough that they are easy to mistake for each other when skimming the text, I'll just say controller job and worker job instead. - when the user starts the batch upload, GWToolset creates a controller job. This job will remain in existence until the upload ends. It handles delays by recreating itself - every time it is invoked, it reads some records from the XML, dispatches some workers, creates a clone of itself with an increased current record counter, puts it at the end of the job queue and exits. The clone is scheduled to run no sooner than now() + $metadata_job_delay (1 minute) . - every time the controller runs, it creates N worker instances, where N is a user-supplied parameter limited by $mediafile_job_throttle_max (20). - the controller does not do anything (other than rescheduling itself) if there are more than $mediafile_job_queue_max (1000) workers. This is a global limit. - every worker instance handles the upload of a single file.
Is the above correct? If so, that means that currently there are at most 20 file uploads per user per minute (less if the user was nice and chose a lower limit). Right now, the upload frequence for normal (manually uploaded) files on Commons is about 5 files per minute (very unscientific number, I just looked at the recent changes page), so a full-speed GWToolset run would fill about 80% of Special:NewFiles. Pulling a number from thin air, we could set this number to 10% and see how well that works - what would mean one file per 2 minutes, no more than 5 at a time (since Special:NewFiles shows 50 files). That would mean setting $mediafile_job_throttle_max to 10 and setting $metadata_job_delay to 20 minutes (since Dan said that in practice it can take significantly more time than $metadata_job_delay for the control job to refresh itself, increasing $metadata_job_delay is preferable to decreasing mediafile_job_throttle_max, otherwise we might end up with a much lower speed than what we aimed for).
This would take care of throttling a single user, but multiple users can still flood the uploads. We could set $mediafile_job_queue_max to the same value as $mediafile_job_throttle_max, that would ensure that it is a global limit, but GWToolset jobs are designed to die if they run into $mediafile_job_queue_max several times, so that would cause lots of failed uploads. As far as I can see, GWToolset has no settings to deal with this problem.
The other alternative is to ignore GWToolset's own throttling settings and use $wgJobBackoffThrottling as Aaron suggested. That basically creates a global lock for 1 / $wgJobBackoffThrottling[$jobtype] minutes every time a job of the given type is run; so if we set it to 0.5, that would guarantee that GWToolset uploads never happen within two minutes of each other, regardless of any GWToolset settings. We will still want to change those settings to ensure that it does not send jobs to the queue significantly faster than they can be processed, otherwise large uploads would error out.
So we would set
\GWToolset\Config::$mediafile_job_throttle_max = 10; \GWToolsetConfig::$metadata_job_delay = '20 minutes'; $wgJobBackoffThrottling['gwtoolsetUploadMediafileJob'] = 0.5;
Does that sound reasonable?