My current thinking on replication lag is that lagged servers shouldn't
serve any read requests at all, until they catch up. I've said that it
would be nice if this was implemented with creative thread scheduling
within MySQL, but we can get pretty close at the application level. We
can just send requests to any up-to-date server, and if there are no
up-to-date servers besides the master, automatically switch the wiki to
read only and serve error messages for most reads. The approach Jamesday
was recommending, i.e. making do with lagged data wherever possible,
just seems to exacerbate the problem, because the read load extends lag
times.
To implement this we need a way for the apache threads to get lag
informaton. We can either:
* Give them the PROCESS privilege and use SHOW PROCESSLIST. I have code
for this already, it's used for maintenance scripts. This won't work
with all MySQL configurations but it will work for us. The security
risks are probably pretty small.
* Upgrade to MySQL 4.1. Not easy due to the need to dump and reimport,
and we don't need to do it for any other reason.
* Use "SELECT max(rc_timestamp)" like servmon. It wouldn't work well for
a site like wikicities but it should work for us most of the time.
* Use a daemon running as wikiadmin to pass information to the apache
threads via memcached. Maintenance and administration is an issue. A
variation on this is for the daemon to set load ratios based on lag
information.
Any thoughts? I'm favouring the first option.
-- Tim Starling