TechCom IRC discussion tonight: Job Queue Issues - Wikitech-l

13 Sep 2017

      Hi all!
This is a quick reminder that tonight, at the TechCOm IRC hour, we will be
talking about the job queue. There have been several issues iwth it lately, and
we want to make sure that we have all relevant aspects on the radar.
As always, the discussion will take place in the IRC channel
#wikimedia-office on Wednesday 21:00 UTC (2pm PDT, 23:00 CEST).
This is not an RFC meeting, as there is no concrete proposal. Rather, it's an
opportunity to further our understanding of the problems and hand, and to float
ideas for possible improvements.
I have prepared a quick brain dump of my current understanding of the job queue
issues https://www.mediawiki.org/wiki/User:Daniel_Kinzler_(WMDE)/Job_Queue.
Here's a copy for your convenience, but please comment directly in the document.
Observations:
* Latest instance of the JQ exploding: https://phabricator.wikimedia.org/T173710
* With 600k jobs in the backlog of commonswiki, only 7k got processed in a day.
* For wikis with just a few thousand pages, we sometimes see millions of
UpdateHtmlCache jobs sitting in the queue.
* Jobs that were triggered months ago were found to continue failing and re-trying
Issues and considerations:
* Jobs re-trying indefinitely
* Deduplication
**  mechanism is obscure/undocumented. Some rely on rootJob parameters, some use
custom logic.
** Batching prevents deduplication. When and how should jobs do batch
operations? Can we automatically break up small batches?
** Delaying jobs may improve deduplication, but support for delayed jobs is
limited/obscure.
** Custom coalescing could improve the chance for deduplication.
* Scope and purpose of some jobs is unclear. E.g. UpdateHtmlCache invalidates
the parser cache, and RefreshLinks re-parse the page - but does not trigger an
UpdateHtmlCache, which it probably should.
* The throttling mechanism does not take into account the nature and run-time of
different job types.
* Scaling is achieved by running more cron jobs.
* Kafka-based JQ is being tested by Services. Generally saner. Should improve
ability to track causality (which job got triggered by which other job). T157088
* No support for recurrent jobs. Should we keep using cron?
-- 
Daniel Kinzler
Principal Platform Engineer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.