Hello all,
This is Asher's writeup of the jobqueue disruption that happened yesterday afternoon Pacific time.
He's not on this list, so please keep him in the cc: if you want him to see your message.
Greg
----- Forwarded message from Asher Feldman afeldman@wikimedia.org -----
Date: Fri, 29 Mar 2013 11:27:13 -0700 From: Asher Feldman afeldman@wikimedia.org To: Operations Engineers ops@lists.wikimedia.org Subject: [Ops] site issues yesterday - jobqueue and wikidata
We had two brief site disruptions yesterday, one in the afternoon that was fairly major but brief (12:40-12:43pm PST) and another that was less severe around 11pm. Both were jobqueue related; the first incident was suspected to be triggered by the wikidata change publisher and the second incident points more strongly in that direction.
As far as what happened - the current mysql jobqueue implementation is way too costly. In the last 24 hours, 75% of all queries that take over 450ms to run on the enwiki master are related to the jobqueue and all major actions result in replicated writes. It's 58% of all query execution time when not looking at over the slow threshold. If 1 million refreshlinks jobs are queued as quickly as possible without paying attention to replication lag, say hello to replication lag. Mediawiki depends on reading from slaves to scale and avoids lagged ones. If all slaves are lagged, the master is used for everything, and if this happens to enwiki, the site falls over.
The wikidata change propagator inserts ChangeNotification jobs into local wiki queues in batches of 1000. The execution of one change job can result in many additional refreshLinks jobs being enqueued. Just prior the the meldown, the wikidata propagator inserted around 7000 jobs into enwiki. That resulted in around 200k refreshlinks jobs getting inserted in a single minute, and around 1.2 million over a slightly longer time. It turns out that trying to reparse 1/4 of enwiki as quickly as possible is a problem :)
Aaron deployed a change last night ( https://gerrit.wikimedia.org/r/#/c/56572/1) that should throttle the insertion of new refreshLinks jobs if the queue is large but not yet sure if that's enough. We may also turn down the wikidata dispatcher batch size, shut down one of its two dispatchers, or again limit how many wikiadmin users can connect to the database to force a concurrency limit on all things job queue related.
The good thing is - the mysql jobqueue was identified as a scaling bottleneck a while ago, and will be switching to redis very soon. It's currently targeted with the release of wmf13, but we may be able to backport to wmf12 and get this done sooner.
In the interim, please do not release anything that will place new demands on the jobqueue, such as echo, or any ramping up of wikidata.
-Asher
Ops mailing list Ops@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/ops
----- End forwarded message -----