On Wed, Apr 6, 2011 at 5:40 PM, Jani Patokallio jpatokal@iki.fi wrote:
So I've got a small wiki cluster which doesn't have particularly high load, but where reliability is important. There is one MySQL DB master (load 0 = write only) and one slave (load 1 = read only).
By default, if MySQL's replication fails, MediaWiki doesn't seem to notice and users get confused since their writes go to master and are not reflected in what they read from the slave. However, the $wgDBservers help page says that there is a "max lag" parameter, defined as "Maximum replication lag before a slave will be taken out of rotation". [snip] If slave now fails, does this mean that a) both reads and writes are sent to master, or that b) MediaWiki switches into read-only mode since there are no slaves left to handle writes? If the answer is "b", is there any way I can make scenario "a" happen?
It sounds like you mainly want to use replication to maintain a hot standby database, not for load balancing. It may actually be best here to not tell MediaWiki about the slave at all: use the master only, and consider using some other HA proxy or whatever to hot-swap the old slave in for the master if the master stops responding.
There's a little note in LoadBalancer::getRandomNonLagged() for the all-slaves-lagged case:
# No appropriate DB servers except maybe the master and some slaves with zero load # Do NOT use the master # Instead, this function will return false, triggering read-only mode, # and a lagged slave will be used instead.
Very early on we did try to fall back to master if no non-lagged slaves were available, however it can be highly problematic if you're using replication for the purpose of load balancing -- which is what MediaWiki's explicit support for multiple database servers is designed for. Under load, you end up with this sequence of events:
* load is relatively high, but read load is mostly spread over slaves * some big operation comes through that causes slaves to lag * requests keep coming in, but since slaves are lagged, all their read load goes to master instead * master is now overloaded with all its own requests *plus* the requests that usually go to the slave servers, making both read/write and read-only operations slow or unavailable * ... operations grind to a halt on your site ... * slave servers finally catch up and start getting load again * hopefully, master load decreases and the site recovers
If you wish to change this logic, I believe you could probably tweak that function, but I'd honestly recommend just letting MediaWiki talk to a single server for your case.
-- brion