Always fearing doing queries on a lagged replica on labs? Not anymore!
While Betacommand's tool [0] was very useful, it was also very inaccurate, as it tried to check the lag by looking at the last rows updated, which can be a lot of time on the least popular wikis.
What I offer now is sub-second accurate lag measuring, by writing on the production masters the current time, in microseconds, every 0.5 seconds and making that available on all hosts (using this tool [1]). So, it is more accurate than SHOW SLAVE STATUS, because it compares the difference with the original master, and it will work even if replication is broken.
To read it, just do SELECT * FROM heartbeat_p.heartbeat; And you will get: +-------+----------------------------+------+ | shard | last_updated | lag | +-------+----------------------------+------+ | s6 | 2015-11-25T20:20:32.000980 | 0 | | s2 | 2015-11-25T20:20:32.001030 | 0 | | s7 | 2015-11-25T20:20:32.001070 | 0 | | s3 | 2015-11-25T20:20:32.001000 | 0 | | s4 | 2015-11-25T20:20:32.000920 | 0 | | s1 | 2015-11-25T20:20:32.000740 | 0 | | s5 | 2015-11-25T20:20:32.000830 | 0 | +-------+----------------------------+------+
Read the detailed documentation on: [2]
Use it, create a web page if you want to make it public! Report a ticket if it gets too high! Report a ticket if you need more info (a record per wiki?). But I wanted to give you the essentials, and you can build yourselves on top of that.
Only 2 know bugs: - There is microsecond accuracy, but it cannot be used until a bug in MariaDB is fixed [3] - enwiki will only report s1 lag until that server is restarted due to some existing filters. We will schedule that at some time in the future.
[0]http://tools.wmflabs.org/betacommand-dev/cgi-bin/replag [1]https://www.percona.com/doc/percona-toolkit/2.2/pt-heartbeat.html [2]< https://wikitech.wikimedia.org/wiki/Help:Tool_Labs/Database#Identifying_lag%... [3]https://mariadb.atlassian.net/browse/MDEV-9175
On Wed, Nov 25, 2015 at 1:21 PM, Jaime Crespo jcrespo@wikimedia.org wrote:
Always fearing doing queries on a lagged replica on labs? Not anymore!
While Betacommand's tool [0] was very useful, it was also very inaccurate, as it tried to check the lag by looking at the last rows updated, which can be a lot of time on the least popular wikis.
What I offer now is sub-second accurate lag measuring, by writing on the production masters the current time, in microseconds, every 0.5 seconds and making that available on all hosts (using this tool [1]). So, it is more accurate than SHOW SLAVE STATUS, because it compares the difference with the original master, and it will work even if replication is broken.
To read it, just do SELECT * FROM heartbeat_p.heartbeat; And you will get: +-------+----------------------------+------+ | shard | last_updated | lag | +-------+----------------------------+------+ | s6 | 2015-11-25T20:20:32.000980 | 0 | | s2 | 2015-11-25T20:20:32.001030 | 0 | | s7 | 2015-11-25T20:20:32.001070 | 0 | | s3 | 2015-11-25T20:20:32.001000 | 0 | | s4 | 2015-11-25T20:20:32.000920 | 0 | | s1 | 2015-11-25T20:20:32.000740 | 0 | | s5 | 2015-11-25T20:20:32.000830 | 0 | +-------+----------------------------+------+
Read the detailed documentation on: [2]
Use it, create a web page if you want to make it public! Report a ticket if it gets too high! Report a ticket if you need more info (a record per wiki?). But I wanted to give you the essentials, and you can build yourselves on top of that.
Only 2 know bugs:
- There is microsecond accuracy, but it cannot be used until a bug in
MariaDB is fixed [3]
- enwiki will only report s1 lag until that server is restarted due to some
existing filters. We will schedule that at some time in the future.
[0]http://tools.wmflabs.org/betacommand-dev/cgi-bin/replag [1]https://www.percona.com/doc/percona-toolkit/2.2/pt-heartbeat.html [2]https://wikitech.wikimedia.org/wiki/Help:Tool_Labs/Database#Identifying_lag [3]https://mariadb.atlassian.net/browse/MDEV-9175
I made a tool [4] that reads the heartbeat_p database on from the server that hosts each shard and matches it with the shard for each wiki. The tool gets all (dbname, slice) pairs from meta_p.wiki and the slice replag from heartbeat_p.heartbeat from the server hosting each slice and then matching them up in the table. I think I got the logic here right, but you can view the source [5] to see if you agree.
[4]: https://tools.wmflabs.org/replag/ [5]: https://tools.wmflabs.org/replag/?source
Bryan
On Thu, 26 Nov 2015 at 22:16 -0700, Bryan Davis wrote:
I made a tool [4] that reads the heartbeat_p database on from the server that hosts each shard and matches it with the shard for each wiki. The tool gets all (dbname, slice) pairs from meta_p.wiki and the slice replag from heartbeat_p.heartbeat from the server hosting each slice and then matching them up in the table. I think I got the logic here right, but you can view the source [5] to see if you agree.
Bryan
Can you add a new set of three column to see the lag per slice to get a more concise way to see lag per slice.