Greetings,
So I've got a small wiki cluster which doesn't have particularly high load, but where reliability is important. There is one MySQL DB master (load 0 = write only) and one slave (load 1 = read only).
By default, if MySQL's replication fails, MediaWiki doesn't seem to notice and users get confused since their writes go to master and are not reflected in what they read from the slave. However, the $wgDBservers help page says that there is a "max lag" parameter, defined as "Maximum replication lag before a slave will be taken out of rotation". Assume I use it like this:
$wgDBservers = array( array( 'host' => "master.serv.er", 'load' => 0, ), array( 'host' => "slave1.serv.er", 'load' => 1, 'max lag' => 30 ) )
If slave now fails, does this mean that a) both reads and writes are sent to master, or that b) MediaWiki switches into read-only mode since there are no slaves left to handle writes? If the answer is "b", is there any way I can make scenario "a" happen?
Cheers, -jani
On Wed, Apr 6, 2011 at 5:40 PM, Jani Patokallio jpatokal@iki.fi wrote:
So I've got a small wiki cluster which doesn't have particularly high load, but where reliability is important. There is one MySQL DB master (load 0 = write only) and one slave (load 1 = read only).
By default, if MySQL's replication fails, MediaWiki doesn't seem to notice and users get confused since their writes go to master and are not reflected in what they read from the slave. However, the $wgDBservers help page says that there is a "max lag" parameter, defined as "Maximum replication lag before a slave will be taken out of rotation". [snip] If slave now fails, does this mean that a) both reads and writes are sent to master, or that b) MediaWiki switches into read-only mode since there are no slaves left to handle writes? If the answer is "b", is there any way I can make scenario "a" happen?
It sounds like you mainly want to use replication to maintain a hot standby database, not for load balancing. It may actually be best here to not tell MediaWiki about the slave at all: use the master only, and consider using some other HA proxy or whatever to hot-swap the old slave in for the master if the master stops responding.
There's a little note in LoadBalancer::getRandomNonLagged() for the all-slaves-lagged case:
# No appropriate DB servers except maybe the master and some slaves with zero load # Do NOT use the master # Instead, this function will return false, triggering read-only mode, # and a lagged slave will be used instead.
Very early on we did try to fall back to master if no non-lagged slaves were available, however it can be highly problematic if you're using replication for the purpose of load balancing -- which is what MediaWiki's explicit support for multiple database servers is designed for. Under load, you end up with this sequence of events:
* load is relatively high, but read load is mostly spread over slaves * some big operation comes through that causes slaves to lag * requests keep coming in, but since slaves are lagged, all their read load goes to master instead * master is now overloaded with all its own requests *plus* the requests that usually go to the slave servers, making both read/write and read-only operations slow or unavailable * ... operations grind to a halt on your site ... * slave servers finally catch up and start getting load again * hopefully, master load decreases and the site recovers
If you wish to change this logic, I believe you could probably tweak that function, but I'd honestly recommend just letting MediaWiki talk to a single server for your case.
-- brion
Thanks for the fast and authoritative reply!
Brion Vibber <brion at pobox.com> wrote:
It sounds like you mainly want to use replication to maintain a hot standby database, not for load balancing. It may actually be best here to not tell MediaWiki about the slave at all: use the master only, and consider using some other HA proxy or whatever to hot-swap the old slave in for the master if the master stops responding.
So the slave becomes the master and starts accepting writes? Doesn't this imply that the direction of replication on the MySQL level has to be reversed after the former master comes back online, and now has to become the slave to get any changes that were made in the meantime? This sounds fairly painful, especially if the failure is intermittent. Alternatively, if you're saying that the slave should be read-only even when it takes over, then this isn't really much of an improvement on the current state of affairs.
Very early on we did try to fall back to master if no non-lagged slaves were available, however it can be highly problematic if you're using replication for the purpose of load balancing -- which is what MediaWiki's explicit support for multiple database servers is designed for.
Problem is, while our system does occasionally get spikes of load usually involving heavy reads of swathes of the database (which is why we've got the master/slave split), the replication typically fails randomly for some reason other than load: network glitches, out of disk space, etc. So it's not that lag is high, it's that replication has failed entirely.
Anyway, I gather that the best thing to do is still set max lag, since at least this way the non-replicating slave switches MediaWiki into read-only mode and the user gets a clear failure message instead of just wondering why their edits seem to disappear into the ether.
Cheers, -jani
Jani Patokallio wrote:
Thanks for the fast and authoritative reply!
Brion Vibber <brion at pobox.com> wrote:
It sounds like you mainly want to use replication to maintain a hot standby database, not for load balancing. It may actually be best here to not tell MediaWiki about the slave at all: use the master only, and consider using some other HA proxy or whatever to hot-swap the old slave in for the master if the master stops responding.
So the slave becomes the master and starts accepting writes? Doesn't this imply that the direction of replication on the MySQL level has to be reversed after the former master comes back online, and now has to become the slave to get any changes that were made in the meantime? This sounds fairly painful, especially if the failure is intermittent. Alternatively, if you're saying that the slave should be read-only even when it takes over, then this isn't really much of an improvement on the current state of affairs.
Yes. If you make promote a slave to a master, the master would then need to be converted into a slave. Even worse, it is possible that some old commits were into the old master but not replicated into the slave (that's typical when the binlog disk gets full). So the data of the old master is wrong and you need now to reimport it.
Very early on we did try to fall back to master if no non-lagged slaves were available, however it can be highly problematic if you're using replication for the purpose of load balancing -- which is what MediaWiki's explicit support for multiple database servers is designed for.
Problem is, while our system does occasionally get spikes of load usually involving heavy reads of swathes of the database (which is why we've got the master/slave split), the replication typically fails randomly for some reason other than load: network glitches, out of disk space, etc. So it's not that lag is high, it's that replication has failed entirely.
Anyway, I gather that the best thing to do is still set max lag, since at least this way the non-replicating slave switches MediaWiki into read-only mode and the user gets a clear failure message instead of just wondering why their edits seem to disappear into the ether.
Yes, seems that the solution Brion suggested is not the appropiate one for your setup.
mediawiki-l@lists.wikimedia.org