[Labs-l] Launching jobs, any limit?

Mon May 26 18:45:05 UTC 2014

Maximilian Doerr <cybernet678 at yahoo.com> wrote in a slightly
different order:

>>> These days I'm processing Wikipedia dumps. Today I tried English Wikipedia,
>>> which is in 150+ chunks (pages-meta-history*.7z).

>>> I have a bash script that launches the jsub jobs, one job per chunk, so I
>>> queued more than +150 jobs. After that, I saw that 95 jobs of them were
>>> started and spread all over the execution nodes.

>>> I saw the load of some of the nodes to reach 250%, is this normal? I
>>> stopped all them because I'm not sure if I have to launch small batches, 10
>>> each time or so, or it is OK to launch all them and ignore the CPU load of
>>> execution nodes.

>> The grid should keep the average load below 1, but that is
>> its job, not yours :-).  So launching 150 jobs is totally
>> fine.  If you see a load of more than 100 % for a prolonged
>> time, notifying an admin doesn't hurt, but due to the nature
>> of the system -- the grid can only guess what the /future/
>> load of a job will be -- outliers are to be expected.

> Wait.  The grid should have a limit of 15.  I've hit that limit so many times, I received my own exec node.

No, the grid should have no limit for the number of jobs
submitted, but limit the number of jobs executed in parallel
per user.  Apparently, the latter got lost during the migra-
tion from pmtpa to eqiad.  I've filed
https://bugzilla.wikimedia.org/65777 for that.

Tim