I wrote:
My current thinking on replication lag is that lagged
servers shouldn't
serve any read requests at all, until they catch up. I've said that it
would be nice if this was implemented with creative thread scheduling
within MySQL, but we can get pretty close at the application level. We
can just send requests to any up-to-date server, and if there are no
up-to-date servers besides the master, automatically switch the wiki to
read only and serve error messages for most reads. The approach Jamesday
was recommending, i.e. making do with lagged data wherever possible,
just seems to exacerbate the problem, because the read load extends lag
times.
I should add, queueing would be an appropriate response to cluster-wide
lag times of up to a few seconds. The request could be queued until a
slave with zero lag time becomes available. Serving an error message
only becomes appropriate when the expected queueing time is long enough
that the user would want an explanation.
Of course then there's the tricky question of when and where to queue
requests. In ordinary operation, connection counts are a useful
indication of load, and limits on connection count can be used for
example to prevent apache from out-competing memcached for CPU time.
Queueing while waiting for some external resource inflates connection
counts, causing limits to be prematurely hit, thus encouraging sysadmins
to set a compromise limit, which may be too high during normal operation.
The other problem is that 500 apache threads waiting for the database
does not constitute a queue. It's more like the start line at a fun run.
When the gun goes off (i.e. a server reaches zero lag), the poor slave
will be swamped, and the replication thread will again be out-competed
until the batch of read requests are completed.
Perhaps the solution is a strict limit on database query concurrency.
Apache threads could call SHOW STATUS, and wait until Threads_running is
sufficiently low. They could also use Threads_connected as an estimate
of the number of other apaches polling the same variable. They could
adjust their polling interval accordingly, to prevent swamping when the
number drops low enough.
The algorithm would look something like this:
do {
$nonLaggingServer = $wgLoadBalancer->getRandomNonLaggingServer();
if ( $nonLaggingServer ) {
$running = $nonLaggingServer->threadsRunning();
$connected = $nonLaggingServer->threadsConnected();
if ( $running > $maxRunning ) {
$sleepTime = $sleepTime = 5ms * $connected;
} else {
$sleepTime = 0;
}
} else {
$sleepTime = 500ms;
}
sleep( $sleepTime );
} while ( ( !$nonLaggingServer || $running > $maxRunning ) &&
!$wgUser->lostPatience() );
Note that if a single overloaded server is non-lagging, the threads
won't get stuck waiting for it. When a second non-lagging server becomes
available, half of them will switch to waiting for that instead.
I don't have any answers for the problem of controlling apache load at
the moment.
-- Tim Starling