[Labs-l] Lag reporting on lab db replicas

Jaime Crespo jcrespo at wikimedia.org
Thu Nov 26 18:03:36 UTC 2015


> How about doing this per wiki?

That is as easy as joining this with the meta tables, I challenge all of
you to see who can make it faster :-)

On Thu, Nov 26, 2015 at 6:49 PM, Yetkin Sakal <superyetkin at yahoo.com> wrote:

> How about doing this per wiki?
>
>
>
>
>
> On Thursday, November 26, 2015 10:19 AM, Jaime Crespo <
> jcrespo at wikimedia.org> wrote:
>
>
> > So even if the replicas don't get updated the heartbeat will report them
> as up to date?
>
> Not sure exactly what you mean with that. The masters will be updated
> continuously every 0.5 seconds (all slaves are read only- no writes are
> done there). If replication works, and slaves get updated, that will mean
> that they will receive the heartbeat with the same replication channel than
> the rest of the updates. If replication doesn't work, and replicas do not
> get updated, they will not receive the heartbeat either, as it comes from
> replication in order. If replication stops/fails, heartbeat update will
> stop (from the slave perspective), and lag will start to increase from your
> perspective (difference between last timestamp written and current time).
>
> This measures the replication lag (aka difference with the master), not
> the last time an edit was done by a user, which was what the first link I
> sent measured. In other words, if jaimewiki receives only user edits every
> hour, heartbeat will still do a write to its master every half a seconds,
> thus proving that it is up to date with that resolution. You can still
> check the last user edit by checking recentchanges.
>
> The only reason this could fail (heartbeat updated but wiki not) is if
> there was a specific filter denying replication but allowing hearbeat, only
> done for specific tables and private wikis. Also the production master
> could have a problem, but that would affect the wikis itselves, not only
> labs.
>
> To give you an idea of the accuracy of this method, we (will) use it on
> production to decide if a slave is usable or not to return up-to-date data.
>
> For more information on how this works, check <
> https://www.percona.com/doc/percona-toolkit/2.1/pt-heartbeat.html#description
> >
>
> On Wed, Nov 25, 2015 at 9:51 PM, Ricordisamoa <
> ricordisamoa at openmailbox.org> wrote:
>
> Il 25/11/2015 21:21, Jaime Crespo ha scritto:
>
> Always fearing doing queries on a lagged replica on labs? Not anymore!
>
> While Betacommand's tool [0] was very useful, it was also very inaccurate,
> as it tried to check the lag by looking at the last rows updated, which can
> be a lot of time on the least popular wikis.
>
> What I offer now is sub-second accurate lag measuring, by writing on the
> production masters the current time, in microseconds, every 0.5 seconds and
> making that available on all hosts (using this tool [1]). So, it is more
> accurate than SHOW SLAVE STATUS, because it compares the difference with
> the original master, and it will work even if replication is broken.
>
>
> So even if the replicas don't get updated the heartbeat will report them
> as up to date?
>
>
> To read it, just do SELECT * FROM heartbeat_p.heartbeat;
> And you will get:
> +-------+----------------------------+------+
> | shard | last_updated               | lag  |
> +-------+----------------------------+------+
> | s6    | 2015-11-25T20:20:32.000980 |    0 |
> | s2    | 2015-11-25T20:20:32.001030 |    0 |
> | s7    | 2015-11-25T20:20:32.001070 |    0 |
> | s3    | 2015-11-25T20:20:32.001000 |    0 |
> | s4    | 2015-11-25T20:20:32.000920 |    0 |
> | s1    | 2015-11-25T20:20:32.000740 |    0 |
> | s5    | 2015-11-25T20:20:32.000830 |    0 |
> +-------+----------------------------+------+
>
> Read the detailed documentation on: [2]
>
> Use it, create a web page if you want to make it public! Report a ticket
> if it gets too high! Report a ticket if you need more info (a record per
> wiki?). But I wanted to give you the essentials, and you can build
> yourselves on top of that.
>
> Only 2 know bugs:
> - There is microsecond accuracy, but it cannot be used until a bug in
> MariaDB is fixed [3]
> - enwiki will only report s1 lag until that server is restarted due to
> some existing filters. We will schedule that at some time in the future.
>
> [0]<http://tools.wmflabs.org/betacommand-dev/cgi-bin/replag>
> [1]<https://www.percona.com/doc/percona-toolkit/2.2/pt-heartbeat.html>
> [2]<
> https://wikitech.wikimedia.org/wiki/Help:Tool_Labs/Database#Identifying_lag
> >
> [3]<https://mariadb.atlassian.net/browse/MDEV-9175>
> --
> Jaime Crespo
> <http://wikimedia.org>
>
>
> _______________________________________________
> Labs-l mailing listLabs-l at lists.wikimedia.orghttps://lists.wikimedia.org/mailman/listinfo/labs-l
>
>
> _______________________________________________
> Labs-l mailing list
> Labs-l at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/labs-l
>
>
>
>
> --
> Jaime Crespo
> <http://wikimedia.org>
>
> _______________________________________________
> Labs-l mailing list
> Labs-l at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/labs-l
>
>
>
> _______________________________________________
> Labs-l mailing list
> Labs-l at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/labs-l
>
>


-- 
Jaime Crespo
<http://wikimedia.org>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.wikimedia.org/pipermail/labs-l/attachments/20151126/8d8695ba/attachment-0001.html>


More information about the Labs-l mailing list