Hello all,
This is Asher's writeup of the jobqueue disruption that happened
yesterday afternoon Pacific time.
He's not on this list, so please keep him in the cc: if you want him to
see your message.
Greg
----- Forwarded message from Asher Feldman <afeldman(a)wikimedia.org> -----
Date: Fri, 29 Mar 2013 11:27:13 -0700
From: Asher Feldman <afeldman(a)wikimedia.org>
To: Operations Engineers <ops(a)lists.wikimedia.org>
Subject: [Ops] site issues yesterday - jobqueue and wikidata
We had two brief site disruptions yesterday, one in the afternoon that was
fairly major but brief (12:40-12:43pm PST) and another that was less severe
around 11pm. Both were jobqueue related; the first incident was suspected
to be triggered by the wikidata change publisher and the second incident
points more strongly in that direction.
As far as what happened - the current mysql jobqueue implementation is way
too costly. In the last 24 hours, 75% of all queries that take over 450ms
to run on the enwiki master are related to the jobqueue and all major
actions result in replicated writes. It's 58% of all query execution time
when not looking at over the slow threshold. If 1 million refreshlinks
jobs are queued as quickly as possible without paying attention to
replication lag, say hello to replication lag. Mediawiki depends on
reading from slaves to scale and avoids lagged ones. If all slaves are
lagged, the master is used for everything, and if this happens to enwiki,
the site falls over.
The wikidata change propagator inserts ChangeNotification jobs into local
wiki queues in batches of 1000. The execution of one change job can result
in many additional refreshLinks jobs being enqueued. Just prior the the
meldown, the wikidata propagator inserted around 7000 jobs into enwiki.
That resulted in around 200k refreshlinks jobs getting inserted in a single
minute, and around 1.2 million over a slightly longer time. It turns out
that trying to reparse 1/4 of enwiki as quickly as possible is a problem :)
Aaron deployed a change last night (
https://gerrit.wikimedia.org/r/#/c/56572/1) that should throttle the
insertion of new refreshLinks jobs if the queue is large but not yet sure
if that's enough. We may also turn down the wikidata dispatcher batch
size, shut down one of its two dispatchers, or again limit how many
wikiadmin users can connect to the database to force a concurrency limit on
all things job queue related.
The good thing is - the mysql jobqueue was identified as a scaling
bottleneck a while ago, and will be switching to redis very soon. It's
currently targeted with the release of wmf13, but we may be able to
backport to wmf12 and get this done sooner.
In the interim, please do not release anything that will place new demands
on the jobqueue, such as echo, or any ramping up of wikidata.
-Asher
_______________________________________________
Ops mailing list
Ops(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/ops
----- End forwarded message -----
--
| Greg Grossmeier GPG: B2FA 27B1 F7EB D327 6B8E |
| identi.ca: @greg A18D 1138 8E47 FAC8 1C7D |