Hi all!
This is a quick reminder that tonight, at the TechCOm IRC hour, we will be talking about the job queue. There have been several issues iwth it lately, and we want to make sure that we have all relevant aspects on the radar.
As always, the discussion will take place in the IRC channel #wikimedia-office on Wednesday 21:00 UTC (2pm PDT, 23:00 CEST).
This is not an RFC meeting, as there is no concrete proposal. Rather, it's an opportunity to further our understanding of the problems and hand, and to float ideas for possible improvements.
I have prepared a quick brain dump of my current understanding of the job queue issues https://www.mediawiki.org/wiki/User:Daniel_Kinzler_(WMDE)/Job_Queue. Here's a copy for your convenience, but please comment directly in the document.
Observations:
* Latest instance of the JQ exploding: https://phabricator.wikimedia.org/T173710
* With 600k jobs in the backlog of commonswiki, only 7k got processed in a day.
* For wikis with just a few thousand pages, we sometimes see millions of UpdateHtmlCache jobs sitting in the queue.
* Jobs that were triggered months ago were found to continue failing and re-trying
Issues and considerations:
* Jobs re-trying indefinitely
* Deduplication ** mechanism is obscure/undocumented. Some rely on rootJob parameters, some use custom logic. ** Batching prevents deduplication. When and how should jobs do batch operations? Can we automatically break up small batches? ** Delaying jobs may improve deduplication, but support for delayed jobs is limited/obscure. ** Custom coalescing could improve the chance for deduplication.
* Scope and purpose of some jobs is unclear. E.g. UpdateHtmlCache invalidates the parser cache, and RefreshLinks re-parse the page - but does not trigger an UpdateHtmlCache, which it probably should.
* The throttling mechanism does not take into account the nature and run-time of different job types.
* Scaling is achieved by running more cron jobs.
* Kafka-based JQ is being tested by Services. Generally saner. Should improve ability to track causality (which job got triggered by which other job). T157088
* No support for recurrent jobs. Should we keep using cron?
wikitech-l@lists.wikimedia.org