On Wed, Apr 6, 2011 at 5:40 PM, Jani Patokallio <jpatokal(a)iki.fi> wrote:
So I've got a small wiki cluster which doesn't
have particularly high
load, but where reliability is important. There is one MySQL DB
master (load 0 = write only) and one slave (load 1 = read only).
By default, if MySQL's replication fails, MediaWiki doesn't seem to
notice and users get confused since their writes go to master and are
not reflected in what they read from the slave. However, the
$wgDBservers help page says that there is a "max lag" parameter,
defined as "Maximum replication lag before a slave will be taken out
of rotation".
[snip]
If slave now fails, does this mean that a) both reads and writes are
sent to master, or that b) MediaWiki switches into read-only mode
since there are no slaves left to handle writes? If the answer is
"b", is there any way I can make scenario "a" happen?
It sounds like you mainly want to use replication to maintain a hot standby
database, not for load balancing. It may actually be best here to not tell
MediaWiki about the slave at all: use the master only, and consider using
some other HA proxy or whatever to hot-swap the old slave in for the master
if the master stops responding.
There's a little note in LoadBalancer::getRandomNonLagged() for the
all-slaves-lagged case:
# No appropriate DB servers except maybe the master and some
slaves with zero load
# Do NOT use the master
# Instead, this function will return false, triggering read-only
mode,
# and a lagged slave will be used instead.
Very early on we did try to fall back to master if no non-lagged slaves were
available, however it can be highly problematic if you're using replication
for the purpose of load balancing -- which is what MediaWiki's explicit
support for multiple database servers is designed for. Under load, you end
up with this sequence of events:
* load is relatively high, but read load is mostly spread over slaves
* some big operation comes through that causes slaves to lag
* requests keep coming in, but since slaves are lagged, all their read load
goes to master instead
* master is now overloaded with all its own requests *plus* the requests
that usually go to the slave servers, making both read/write and read-only
operations slow or unavailable
* ... operations grind to a halt on your site ...
* slave servers finally catch up and start getting load again
* hopefully, master load decreases and the site recovers
If you wish to change this logic, I believe you could probably tweak that
function, but I'd honestly recommend just letting MediaWiki talk to a single
server for your case.
-- brion