[Mediawiki-l] Sending DB reads to master if slave replication fails?

Thu Apr 7 01:02:01 UTC 2011

On Wed, Apr 6, 2011 at 5:40 PM, Jani Patokallio <jpatokal at iki.fi> wrote:

> So I've got a small wiki cluster which doesn't have particularly high
> load, but where reliability is important.  There is one MySQL DB
> master (load 0 = write only) and one slave (load 1 = read only).
>
> By default, if MySQL's replication fails, MediaWiki doesn't seem to
> notice and users get confused since their writes go to master and are
> not reflected in what they read from the slave.  However, the
> $wgDBservers help page says that there is a "max lag" parameter,
> defined as "Maximum replication lag before a slave will be taken out
> of rotation".
> [snip]
> If slave now fails, does this mean that a) both reads and writes are
> sent to master, or that b) MediaWiki switches into read-only mode
> since there are no slaves left to handle writes?  If the answer is
> "b", is there any way I can make scenario "a" happen?
>

It sounds like you mainly want to use replication to maintain a hot standby
database, not for load balancing. It may actually be best here to not tell
MediaWiki about the slave at all: use the master only, and consider using
some other HA proxy or whatever to hot-swap the old slave in for the master
if the master stops responding.

There's a little note in LoadBalancer::getRandomNonLagged() for the
all-slaves-lagged case:

            # No appropriate DB servers except maybe the master and some
slaves with zero load
            # Do NOT use the master
            # Instead, this function will return false, triggering read-only
mode,
            # and a lagged slave will be used instead.

Very early on we did try to fall back to master if no non-lagged slaves were
available, however it can be highly problematic if you're using replication
for the purpose of load balancing -- which is what MediaWiki's explicit
support for multiple database servers is designed for. Under load, you end
up with this sequence of events:

* load is relatively high, but read load is mostly spread over slaves
* some big operation comes through that causes slaves to lag
* requests keep coming in, but since slaves are lagged, all their read load
goes to master instead
* master is now overloaded with all its own requests *plus* the requests
that usually go to the slave servers, making both read/write and read-only
operations slow or unavailable
* ... operations grind to a halt on your site ...
* slave servers finally catch up and start getting load again
* hopefully, master load decreases and the site recovers

If you wish to change this logic, I believe you could probably tweak that
function, but I'd honestly recommend just letting MediaWiki talk to a single
server for your case.

-- brion