Hi Sean,
*However*, the consumer logs indicate the insert timestamp only, not the event timestamp (which goes to the table). So it could be that there's some data loss inside the consumer code (or in zmq?) that wouldn't stop the write flow, but would skip a segment of events. I'll look deeper into this.
We've deployed an improvement on the EL mysql consumer logs, to be sure that the events that were being inserted at the time of the DB gaps corresponded indeed with the missing data. And we've seen that the response is yes, the consumer executes the missing inserts in time and without errors in the sqlalchemy client.
Can you supply some specific records from the EL logs with timestamps
that should definitely be in the database, so we can scan the database
binlog for specific UUIDs or suchlike?
Here are three valid events that were apparently inserted correctly, but don't appear in the db.
(they are performance events and contain no sensitive data)
-- can you give me some idea of how long your "at other moments" delay
is?
I followed the master-slave replication lag for some hours, and perceived a pattern in the lag: It gets progressively bigger with time, more or less with a 10 minute increase per hour, reaching lags of 1 to 2 hours. At that point, the data gap happens and the replication lag goes back to few minutes lag. I could only catch a data gap "live" 2 times, so that's definitely not a conclusive statement. But, there's this hypothesis that the two problems are related.
Sean, I hope that helps answering your questions.
Let us know if you have any idea on this.
Thank you!
Marcel