-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Hi,
This mail describes two separate changes to how database access works. Please read all of it and make the necessary changes to your tools.
Summary =======
This is a brief summary of the changes you should make:
* If you currently use the fast server (sql-sX-fast or XXwiki-p.fastdb), change back to the normal server, then follow the rest of these instructions.
* For tools which only connect to 'sql', no changes are necessary.
* For tools which use database servers but *do not* use user databases:
* If you currently connect to sql-sX.toolserver.org, instead connect to sql-sX-rr.toolserver.org
* If you currently connect to XXwiki-p.db.toolserver.org, instead connect to XXwiki-p.rrdb.toolserver.org
* For tools which use database servers and *do* use user databases:
* If you currently connect to sql-sX.toolserver.org, instead connect to sql-sX-user.toolserver.org
* If you currently connect to XXwiki-p.db.toolserver.org, instead connect to XXwiki-p.userdb.toolserver.org
* If you have any queries which could run for longer than 10 minutes when working correctly, add the string SLOW_OK somewhere in the query, e.g.:
SELECT /* SLOW_OK */ * FROM table...
I will update the documentation on the wiki shortly to reflect these changes.
The rest of this mail describes the changes in detail.
RR servers ==========
A while ago, we introduced the idea of 'fast' database servers, which only allowed queries running for less than 60 seconds. The idea was that since there were no long queries to create load on the server, it would be less likely to have replication lag than the normal servers.
Since the introduction of fast servers, we have seen very little take-up of them, even for tools which could usefully use them. Additionally, we have not had much issue with replication lag recently.
We will therefore be retiring the fast servers, and replacing them with RR servers. To connect to an RR server, the following hostnames should be used:
* sql-sX-rr.toolserver.org * XXwiki-p.rrdb.toolserver.org
Unlike the normal aliases, the RR server will randomly connect you to one of the two servers in each cluster. (For example, when connecting to sql-s1-rr, you will randomly connect to either thyme or rosemary.)
There is no disadvantage for users to connect to the RR alias (since there is no limit on query execution time), and this will allow us to better distribute load among the database servers, which will reduce replication lag for everyone. It also makes it easier for us to add additional database servers to a cluster in the future.
The only tools which cannot use the RR servers are tools which access user databases, since these databases are still only present on a single server. These tools should instead connect to the new "user" aliases:
* sql-sX-user.toolserver.org * XXwiki-p.userdb.toolserver.org
The user aliases will always point to the server which currently contains the user databases.
Long query killer =================
To help prevent replication lag, we will be introducing an automatic query killer on all servers. This will work as follows:
* If the replication lag is under 10 minutes, no queries will be killed.
* If the replication lag is 10 minutes or more, but less than 30 minutes, queries will be killed if the following two conditions are both true:
1. The query does not contain the text SLOW_OK
2. The query has been running for X seconds or more, where X is the current replication lag.
* If the replication lag is 30 minutes or more, queries will be killed if the following condition is true:
1. The query has been running for X seconds or more, where X is the current replication lag.
This is intended to only kill queries which are causing replication lag (in particular, queries which cause InnoDB lock wait timeouts). We will monitor the performance of the query killer and might adjust the parameters in the future.
If you find that your queries are being killed and you don't think they should have been, please open a request in JIRA.
- river.