[Labs-l] Lag reporting on lab db replicas

Jaime Crespo jcrespo at wikimedia.org
Wed Nov 25 20:21:33 UTC 2015


Always fearing doing queries on a lagged replica on labs? Not anymore!

While Betacommand's tool [0] was very useful, it was also very inaccurate,
as it tried to check the lag by looking at the last rows updated, which can
be a lot of time on the least popular wikis.

What I offer now is sub-second accurate lag measuring, by writing on the
production masters the current time, in microseconds, every 0.5 seconds and
making that available on all hosts (using this tool [1]). So, it is more
accurate than SHOW SLAVE STATUS, because it compares the difference with
the original master, and it will work even if replication is broken.

To read it, just do SELECT * FROM heartbeat_p.heartbeat;
And you will get:
+-------+----------------------------+------+
| shard | last_updated               | lag  |
+-------+----------------------------+------+
| s6    | 2015-11-25T20:20:32.000980 |    0 |
| s2    | 2015-11-25T20:20:32.001030 |    0 |
| s7    | 2015-11-25T20:20:32.001070 |    0 |
| s3    | 2015-11-25T20:20:32.001000 |    0 |
| s4    | 2015-11-25T20:20:32.000920 |    0 |
| s1    | 2015-11-25T20:20:32.000740 |    0 |
| s5    | 2015-11-25T20:20:32.000830 |    0 |
+-------+----------------------------+------+

Read the detailed documentation on: [2]

Use it, create a web page if you want to make it public! Report a ticket if
it gets too high! Report a ticket if you need more info (a record per
wiki?). But I wanted to give you the essentials, and you can build
yourselves on top of that.

Only 2 know bugs:
- There is microsecond accuracy, but it cannot be used until a bug in
MariaDB is fixed [3]
- enwiki will only report s1 lag until that server is restarted due to some
existing filters. We will schedule that at some time in the future.

[0]<http://tools.wmflabs.org/betacommand-dev/cgi-bin/replag>
[1]<https://www.percona.com/doc/percona-toolkit/2.2/pt-heartbeat.html>
[2]<
https://wikitech.wikimedia.org/wiki/Help:Tool_Labs/Database#Identifying_lag>
[3]<https://mariadb.atlassian.net/browse/MDEV-9175>
-- 
Jaime Crespo
<http://wikimedia.org>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.wikimedia.org/pipermail/labs-l/attachments/20151125/0af156e0/attachment.html>


More information about the Labs-l mailing list