>This sounds like the fixes we did last quarter to the batch insertion basically hid the problem instead of making it go away.
I think we are mixing things here, when we had issues with batching code we never saw a pattern of "no-events-whatsoever-in-any-table for an hour". We saw events dropped in bursts here and there but certainly not  an "hour long blackout".

Also, there were no events dropped when we did the major backfilling in early march where the db sustained quite a bit of load as we had to insert those one by one.

So (while I am not saying we could not uncover a code issue in our end) we have not seen this particular error pattern before. 









On Wed, Apr 15, 2015 at 3:00 PM, Dan Andreescu <dandreescu@wikimedia.org> wrote:
This sounds like the fixes we did last quarter to the batch insertion basically hid the problem instead of making it go away.

On Wed, Apr 15, 2015 at 5:58 PM, Marcel Ruiz Forns <mforns@wikimedia.org> wrote:
Hi Sean,
 
*However*, the consumer logs indicate the insert timestamp only, not the event timestamp (which goes to the table). So it could be that there's some data loss inside the consumer code (or in zmq?) that wouldn't stop the write flow, but would skip a segment of events. I'll look deeper into this.

We've deployed an improvement on the EL mysql consumer logs, to be sure that the events that were being inserted at the time of the DB gaps corresponded indeed with the missing data. And we've seen that the response is yes, the consumer executes the missing inserts in time and without errors in the sqlalchemy client.

Can you supply some specific records from the EL logs with timestamps
that should definitely be in the database, so we can scan the database
binlog for specific UUIDs or suchlike?

Here are three valid events that were apparently inserted correctly, but don't appear in the db.
(they are performance events and contain no sensitive data)

-- can you give me some idea of how long your "at other moments" delay
is?

I followed the master-slave replication lag for some hours, and perceived a pattern in the lag: It gets progressively bigger with time, more or less with a 10 minute increase per hour, reaching lags of 1 to 2 hours. At that point, the data gap happens and the replication lag goes back to few minutes lag. I could only catch a data gap "live" 2 times, so that's definitely not a conclusive statement. But, there's this hypothesis that the two problems are related.

Sean, I hope that helps answering your questions.
Let us know if you have any idea on this.

Thank you!

Marcel

On Tue, Apr 14, 2015 at 9:15 PM, Marcel Ruiz Forns <mforns@wikimedia.org> wrote:
Sean, thanks for the quick response:
 
We have a binary log on the EL master that holds the last week of
INSERT statements. It can be dumped and grepped, eg looking at
10-minute blocks around 2015-04-13 16:30:

Good to know!

Zero(!) during 10min after 16:30 doesn't look good. This means the
database master did not see any INSERTs with 20150413163* timestamps
within the last week.

Ok, makes more sense. 
 
Can you describe how you know that events were
written normally? Simply a lack of errors from mysql consumer?

MySQL consumer log not only lacks errors, but has records of successful writes to the db at the time of the problems. Also, the processor logs indicate homogeneous throughput of valid events during all times.

*However*, the consumer logs indicate the insert timestamp only, not the event timestamp (which goes to the table). So it could be that there's some data loss inside the consumer code (or in zmq?) that wouldn't stop the write flow, but would skip a segment of events. I'll look deeper into this.
 
Can you supply some specific records from the EL logs with timestamps
that should definitely be in the database, so we can scan the database
binlog for specific UUIDs or suchlike?

I'll try to get those. 

-- can you give me some idea of how long your "at other moments" delay
is?

I'll observe the db during the day and give you an estimate.

Thanks!



_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics



_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics