On Tue, Apr 29, 2014 at 7:19 AM, dan entous <d_entous(a)yahoo.com> wrote:
the config values to most likely change would be
$mediafile_job_throttle_default, which is currently set to 10 and
$mediafile_job_throttle_max, which is currently set to 20.
at the moment, a user can set this throttle between 1-20. that means that
every time a GWToolset Metadata background job is run between 1-20
GWToolset Mediafile jobs are added to the queue.
Just to verify that I am reading the code right:
- GWToolset has two main job types (and a third for the final cleanup, but
that's not relevant now), UploadMetadataJob and UploadMediafileJob. The
names are misleading: actually UploadMediafileJob does all the uploading,
while UploadMetadataJob acts as a sort of daemon and spawns new
UploadMediafileJob intances. Since they sound similar enough that they are
easy to mistake for each other when skimming the text, I'll just say
controller job and worker job instead.
- when the user starts the batch upload, GWToolset creates a controller
job. This job will remain in existence until the upload ends. It handles
delays by recreating itself - every time it is invoked, it reads some
records from the XML, dispatches some workers, creates a clone of itself
with an increased current record counter, puts it at the end of the job
queue and exits. The clone is scheduled to run no sooner than now() +
$metadata_job_delay (1 minute) .
- every time the controller runs, it creates N worker instances, where N is
a user-supplied parameter limited by $mediafile_job_throttle_max (20).
- the controller does not do anything (other than rescheduling itself) if
there are more than $mediafile_job_queue_max (1000) workers. This is a
global limit.
- every worker instance handles the upload of a single file.
Is the above correct? If so, that means that currently there are at most 20
file uploads per user per minute (less if the user was nice and chose a
lower limit). Right now, the upload frequence for normal (manually
uploaded) files on Commons is about 5 files per minute (very unscientific
number, I just looked at the recent changes page), so a full-speed
GWToolset run would fill about 80% of Special:NewFiles. Pulling a number
from thin air, we could set this number to 10% and see how well that works
- what would mean one file per 2 minutes, no more than 5 at a time
(since Special:NewFiles shows 50 files). That would mean setting
$mediafile_job_throttle_max to 10 and setting $metadata_job_delay to 20
minutes (since Dan said that in practice it can take significantly more
time than $metadata_job_delay for the control job to refresh itself,
increasing $metadata_job_delay is preferable to decreasing
mediafile_job_throttle_max, otherwise we might end up with a much lower
speed than what we aimed for).
This would take care of throttling a single user, but multiple users can
still flood the uploads. We could set $mediafile_job_queue_max to the same
value as $mediafile_job_throttle_max, that would ensure that it is a global
limit, but GWToolset jobs are designed to die if they run into
$mediafile_job_queue_max several times, so that would cause lots of failed
uploads. As far as I can see, GWToolset has no settings to deal with this
problem.
The other alternative is to ignore GWToolset's own throttling settings and
use $wgJobBackoffThrottling as Aaron suggested. That basically creates a
global lock for 1 / $wgJobBackoffThrottling[$jobtype] minutes every time a
job of the given type is run; so if we set it to 0.5, that would guarantee
that GWToolset uploads never happen within two minutes of each other,
regardless of any GWToolset settings. We will still want to change those
settings to ensure that it does not send jobs to the queue significantly
faster than they can be processed, otherwise large uploads would error out.
So we would set
\GWToolset\Config::$mediafile_job_throttle_max = 10;
\GWToolsetConfig::\$metadata_job_delay = '20 minutes';
$wgJobBackoffThrottling['gwtoolsetUploadMediafileJob'] = 0.5;
Does that sound reasonable?