This sounds like the fixes we did last quarter to the batch insertion basically hid the problem instead of making it go away.

On Wed, Apr 15, 2015 at 5:58 PM, Marcel Ruiz Forns <mforns@wikimedia.org> wrote:
Hi Sean,
 
*However*, the consumer logs indicate the insert timestamp only, not the event timestamp (which goes to the table). So it could be that there's some data loss inside the consumer code (or in zmq?) that wouldn't stop the write flow, but would skip a segment of events. I'll look deeper into this.

We've deployed an improvement on the EL mysql consumer logs, to be sure that the events that were being inserted at the time of the DB gaps corresponded indeed with the missing data. And we've seen that the response is yes, the consumer executes the missing inserts in time and without errors in the sqlalchemy client.

Can you supply some specific records from the EL logs with timestamps
that should definitely be in the database, so we can scan the database
binlog for specific UUIDs or suchlike?

Here are three valid events that were apparently inserted correctly, but don't appear in the db.
http://pastebin.com/8wm6qkkE
(they are performance events and contain no sensitive data)

-- can you give me some idea of how long your "at other moments" delay
is?

I followed the master-slave replication lag for some hours, and perceived a pattern in the lag: It gets progressively bigger with time, more or less with a 10 minute increase per hour, reaching lags of 1 to 2 hours. At that point, the data gap happens and the replication lag goes back to few minutes lag. I could only catch a data gap "live" 2 times, so that's definitely not a conclusive statement. But, there's this hypothesis that the two problems are related.

Sean, I hope that helps answering your questions.
Let us know if you have any idea on this.

Thank you!

Marcel

On Tue, Apr 14, 2015 at 9:15 PM, Marcel Ruiz Forns <mforns@wikimedia.org> wrote:
Sean, thanks for the quick response:
 
We have a binary log on the EL master that holds the last week of
INSERT statements. It can be dumped and grepped, eg looking at
10-minute blocks around 2015-04-13 16:30:

Good to know!

Zero(!) during 10min after 16:30 doesn't look good. This means the
database master did not see any INSERTs with 20150413163* timestamps
within the last week.

Ok, makes more sense. 
 
Can you describe how you know that events were
written normally? Simply a lack of errors from mysql consumer?

MySQL consumer log not only lacks errors, but has records of successful writes to the db at the time of the problems. Also, the processor logs indicate homogeneous throughput of valid events during all times.

*However*, the consumer logs indicate the insert timestamp only, not the event timestamp (which goes to the table). So it could be that there's some data loss inside the consumer code (or in zmq?) that wouldn't stop the write flow, but would skip a segment of events. I'll look deeper into this.
 
Can you supply some specific records from the EL logs with timestamps
that should definitely be in the database, so we can scan the database
binlog for specific UUIDs or suchlike?

I'll try to get those. 

-- can you give me some idea of how long your "at other moments" delay
is?

I'll observe the db during the day and give you an estimate.

Thanks!



_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics