On Thu, Apr 16, 2015 at 7:58 AM, Marcel Ruiz Forns mforns@wikimedia.org wrote:
I followed the master-slave replication lag for some hours, and perceived a pattern in the lag: It gets progressively bigger with time, more or less with a 10 minute increase per hour, reaching lags of 1 to 2 hours. At that point, the data gap happens and the replication lag goes back to few minutes lag. I could only catch a data gap "live" 2 times, so that's definitely not a conclusive statement. But, there's this hypothesis that the two problems are related.
Today I've run some sync tests between EL master and analytisc-slave. So far I've not found any discrepancies -- the master and slave tables, when replication is caught up(!), have identical data. I infer that the data gaps you found do exist but are not related to replication or replication lag, and are occurring somewhere upstream of analytics-store, either on the EL master (db1046) itself or between the master and the consumer. I'll wait to see the example UUIDs to dig further in the master binary logs.
Regarding the replication lag; a few observations:
- Asynchronous replication will always be susceptible to lag as long as the slave handles other traffic. The fixes done to have the consumer batch-insert records have greatly reduced the lag problem so that we havn't seen 24hour+ lag in months, but asynchronous replication does just what it says on the tin :-)
- An hour or two lag observed infrequently is often due to some *other* activity on the slave. The way to track it down is to first look for patterns -- eg, a certain time of day may indicate a poorly optimized cron job or suchlike. If you do catch replication lag of greater than 5min in the act, view the DB processlist to see what other queries are executing. Check if something is simply hammering the box, or if something is locking records or tables that are attempting to replicate, or ... [insert strange cause here].
BR Sean