[Technical] Strange behavior of EL m4-master

List overview All Threads
Download

newer

older

Research & Data quarterly review

Re: [Analytics] measuring traffic...

Marcel Ruiz Forns

14 Apr 2015 14 Apr '15

6:23 p.m.

Hi Sean,

Here's Marcel from Analytics.

I'd like to comment with you some strange behaviors that we've observed on EventLogging database (m4-master.equiad.wmnet).

1) There are some time spans where there is no data in any table. Examples follow:

- 2015-04-09 17:20 -> 18:35 - 2015-04-11 03:35 -> 05:20 - 2015-04-11 14:00 -> 15:20 - 2015-04-11 19:00 -> 20:00 - 2015-04-12 14:30 -> 15:40 - 2015-04-13 11:35 -> 12:30 - 2015-04-13 16:30 -> 17:35

This is happening a lot, as you can see, so we are really concerned about it. These outages are not explained by any of the EventLogging logs (processor logs, consumer logs) which confirm that the events were actually sent normally to the database (and written) for those time spans.

2) Maybe it has something to do with it, or not. But we observed that sometimes the last events inserted in the tables date from 30mins - 1h30mins in the past. At other moments, the last inserts date from seconds ago. This is maybe expected due to slave replication but we're not sure.

Do you have some opinion on what can be happening? Any thought would help a lot.

Thanks!

Marcel

Attachments:

attachment.htm (text/html — 1.4 KB)

Show replies by date

Sean Pringle

14 Apr 14 Apr

8:22 p.m.

Hi Marcel

On Wed, Apr 15, 2015 at 2:23 AM, Marcel Ruiz Forns mforns@wikimedia.org wrote:

...

Hi Sean,

Here's Marcel from Analytics.

I'd like to comment with you some strange behaviors that we've observed on EventLogging database (m4-master.equiad.wmnet).

There are some time spans where there is no data in any table. Examples

follow:

2015-04-09 17:20 -> 18:35 2015-04-11 03:35 -> 05:20 2015-04-11 14:00 -> 15:20 2015-04-11 19:00 -> 20:00 2015-04-12 14:30 -> 15:40 2015-04-13 11:35 -> 12:30 2015-04-13 16:30 -> 17:35

This is happening a lot, as you can see, so we are really concerned about it. These outages are not explained by any of the EventLogging logs (processor logs, consumer logs) which confirm that the events were actually sent normally to the database (and written) for those time spans.

We have a binary log on the EL master that holds the last week of INSERT statements. It can be dumped and grepped, eg looking at 10-minute blocks around 2015-04-13 16:30:

root@db1046:# /opt/wmf-mariadb10/bin/mysqlbinlog db1046-bin.000* | grep INSERT | grep '20150413161' | wc -l 26849

root@db1046:# /opt/wmf-mariadb10/bin/mysqlbinlog db1046-bin.000* | grep INSERT | grep '20150413163' | wc -l 0

Zero(!) during 10min after 16:30 doesn't look good. This means the database master did not see any INSERTs with 20150413163* timestamps within the last week. Can you describe how you know that events were written normally? Simply a lack of errors from mysql consumer?

Can you supply some specific records from the EL logs with timestamps that should definitely be in the database, so we can scan the database binlog for specific UUIDs or suchlike?

...

Maybe it has something to do with it, or not. But we observed that

sometimes the last events inserted in the tables date from 30mins - 1h30mins in the past. At other moments, the last inserts date from seconds ago. This is maybe expected due to slave replication but we're not sure.

Simple answer is: Yes, lag is quite possible given the high write rate of EL, and can also be affected by load from other replication streams of the production wikis (cumulative and affected by time of day, spikes, jobrunners, etc). It's possible we could tune it better though -- can you give me some idea of how long your "at other moments" delay is?

BR Sean

Marcel Ruiz Forns

9:15 p.m.

Sean, thanks for the quick response:

...

We have a binary log on the EL master that holds the last week of INSERT statements. It can be dumped and grepped, eg looking at 10-minute blocks around 2015-04-13 16:30:

Good to know!

Zero(!) during 10min after 16:30 doesn't look good. This means the

...

database master did not see any INSERTs with 20150413163* timestamps within the last week.

Ok, makes more sense.

...

Can you describe how you know that events were written normally? Simply a lack of errors from mysql consumer?

MySQL consumer log not only lacks errors, but has records of successful writes to the db at the time of the problems. Also, the processor logs indicate homogeneous throughput of valid events during all times.

*However*, the consumer logs indicate the insert timestamp only, not the event timestamp (which goes to the table). So it could be that there's some data loss inside the consumer code (or in zmq?) that wouldn't stop the write flow, but would skip a segment of events. I'll look deeper into this.

...

Can you supply some specific records from the EL logs with timestamps that should definitely be in the database, so we can scan the database binlog for specific UUIDs or suchlike?

I'll try to get those.

-- can you give me some idea of how long your "at other moments" delay

...

is?

I'll observe the db during the day and give you an estimate.

Thanks!

Marcel Ruiz Forns

15 Apr 15 Apr

11:58 p.m.

Hi Sean,

...

*However*, the consumer logs indicate the insert timestamp only, not the event timestamp (which goes to the table). So it could be that there's some data loss inside the consumer code (or in zmq?) that wouldn't stop the write flow, but would skip a segment of events. I'll look deeper into this.

We've deployed an improvement on the EL mysql consumer logs, to be sure that the events that were being inserted at the time of the DB gaps corresponded indeed with the missing data. And we've seen that the response is yes, the consumer executes the missing inserts in time and without errors in the sqlalchemy client.

Can you supply some specific records from the EL logs with timestamps

...

that should definitely be in the database, so we can scan the database binlog for specific UUIDs or suchlike?

Here are three valid events that were apparently inserted correctly, but don't appear in the db. http://pastebin.com/8wm6qkkE (they are performance events and contain no sensitive data)

-- can you give me some idea of how long your "at other moments" delay

...

is?

I followed the master-slave replication lag for some hours, and perceived a pattern in the lag: It gets progressively bigger with time, more or less with a 10 minute increase per hour, reaching lags of 1 to 2 hours. At that point, the data gap happens and the replication lag goes back to few minutes lag. I could only catch a data gap "live" 2 times, so that's definitely not a conclusive statement. But, there's this hypothesis that the two problems are related.

Sean, I hope that helps answering your questions. Let us know if you have any idea on this.

Thank you!

Marcel

On Tue, Apr 14, 2015 at 9:15 PM, Marcel Ruiz Forns mforns@wikimedia.org wrote:

...

Sean, thanks for the quick response:

...
We have a binary log on the EL master that holds the last week of INSERT statements. It can be dumped and grepped, eg looking at 10-minute blocks around 2015-04-13 16:30:

Good to know!

Zero(!) during 10min after 16:30 doesn't look good. This means the

...
database master did not see any INSERTs with 20150413163* timestamps within the last week.

Ok, makes more sense.

...
Can you describe how you know that events were written normally? Simply a lack of errors from mysql consumer?

MySQL consumer log not only lacks errors, but has records of successful writes to the db at the time of the problems. Also, the processor logs indicate homogeneous throughput of valid events during all times.

*However*, the consumer logs indicate the insert timestamp only, not the event timestamp (which goes to the table). So it could be that there's some data loss inside the consumer code (or in zmq?) that wouldn't stop the write flow, but would skip a segment of events. I'll look deeper into this.

...
Can you supply some specific records from the EL logs with timestamps that should definitely be in the database, so we can scan the database binlog for specific UUIDs or suchlike?

I'll try to get those.

-- can you give me some idea of how long your "at other moments" delay

...
is?

I'll observe the db during the day and give you an estimate.

Thanks!

Dan Andreescu

16 Apr 16 Apr

midnight

This sounds like the fixes we did last quarter to the batch insertion basically hid the problem instead of making it go away.

On Wed, Apr 15, 2015 at 5:58 PM, Marcel Ruiz Forns mforns@wikimedia.org wrote:

...

Hi Sean,

...
*However*, the consumer logs indicate the insert timestamp only, not the event timestamp (which goes to the table). So it could be that there's some data loss inside the consumer code (or in zmq?) that wouldn't stop the write flow, but would skip a segment of events. I'll look deeper into this.

We've deployed an improvement on the EL mysql consumer logs, to be sure that the events that were being inserted at the time of the DB gaps corresponded indeed with the missing data. And we've seen that the response is yes, the consumer executes the missing inserts in time and without errors in the sqlalchemy client.

Can you supply some specific records from the EL logs with timestamps

...
that should definitely be in the database, so we can scan the database binlog for specific UUIDs or suchlike?

Here are three valid events that were apparently inserted correctly, but don't appear in the db. http://pastebin.com/8wm6qkkE (they are performance events and contain no sensitive data)

-- can you give me some idea of how long your "at other moments" delay

...
is?

I followed the master-slave replication lag for some hours, and perceived a pattern in the lag: It gets progressively bigger with time, more or less with a 10 minute increase per hour, reaching lags of 1 to 2 hours. At that point, the data gap happens and the replication lag goes back to few minutes lag. I could only catch a data gap "live" 2 times, so that's definitely not a conclusive statement. But, there's this hypothesis that the two problems are related.

Sean, I hope that helps answering your questions. Let us know if you have any idea on this.

Thank you!

Marcel

On Tue, Apr 14, 2015 at 9:15 PM, Marcel Ruiz Forns mforns@wikimedia.org wrote:

...
Sean, thanks for the quick response:

...
We have a binary log on the EL master that holds the last week of INSERT statements. It can be dumped and grepped, eg looking at 10-minute blocks around 2015-04-13 16:30:

Good to know!

Zero(!) during 10min after 16:30 doesn't look good. This means the

...
database master did not see any INSERTs with 20150413163* timestamps within the last week.

Ok, makes more sense.

...
Can you describe how you know that events were written normally? Simply a lack of errors from mysql consumer?

MySQL consumer log not only lacks errors, but has records of successful writes to the db at the time of the problems. Also, the processor logs indicate homogeneous throughput of valid events during all times.

*However*, the consumer logs indicate the insert timestamp only, not the event timestamp (which goes to the table). So it could be that there's some data loss inside the consumer code (or in zmq?) that wouldn't stop the write flow, but would skip a segment of events. I'll look deeper into this.

...
Can you supply some specific records from the EL logs with timestamps that should definitely be in the database, so we can scan the database binlog for specific UUIDs or suchlike?

I'll try to get those.

-- can you give me some idea of how long your "at other moments" delay

...
is?

I'll observe the db during the day and give you an estimate.

Thanks!

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Nuria Ruiz

1:34 a.m.

...

This sounds like the fixes we did last quarter to the batch insertion

basically hid the problem instead of making it go away. I think we are mixing things here, when we had issues with batching code we never saw a pattern of "no-events-whatsoever-in-any-table for an hour". We saw events dropped in bursts here and there but certainly not an "hour long blackout".

Also, there were no events dropped when we did the major backfilling in early march where the db sustained quite a bit of load as we had to insert those one by one.

So (while I am not saying we could not uncover a code issue in our end) we have not seen this particular error pattern before.

On Wed, Apr 15, 2015 at 3:00 PM, Dan Andreescu dandreescu@wikimedia.org wrote:

...

This sounds like the fixes we did last quarter to the batch insertion basically hid the problem instead of making it go away.

On Wed, Apr 15, 2015 at 5:58 PM, Marcel Ruiz Forns mforns@wikimedia.org wrote:

...
Hi Sean,

...
*However*, the consumer logs indicate the insert timestamp only, not the event timestamp (which goes to the table). So it could be that there's some data loss inside the consumer code (or in zmq?) that wouldn't stop the write flow, but would skip a segment of events. I'll look deeper into this.

We've deployed an improvement on the EL mysql consumer logs, to be sure that the events that were being inserted at the time of the DB gaps corresponded indeed with the missing data. And we've seen that the response is yes, the consumer executes the missing inserts in time and without errors in the sqlalchemy client.

Can you supply some specific records from the EL logs with timestamps

...
that should definitely be in the database, so we can scan the database binlog for specific UUIDs or suchlike?

Here are three valid events that were apparently inserted correctly, but don't appear in the db. http://pastebin.com/8wm6qkkE (they are performance events and contain no sensitive data)

-- can you give me some idea of how long your "at other moments" delay

...
is?

I followed the master-slave replication lag for some hours, and perceived a pattern in the lag: It gets progressively bigger with time, more or less with a 10 minute increase per hour, reaching lags of 1 to 2 hours. At that point, the data gap happens and the replication lag goes back to few minutes lag. I could only catch a data gap "live" 2 times, so that's definitely not a conclusive statement. But, there's this hypothesis that the two problems are related.

Sean, I hope that helps answering your questions. Let us know if you have any idea on this.

Thank you!

Marcel

On Tue, Apr 14, 2015 at 9:15 PM, Marcel Ruiz Forns mforns@wikimedia.org wrote:

...
Sean, thanks for the quick response:

...
We have a binary log on the EL master that holds the last week of INSERT statements. It can be dumped and grepped, eg looking at 10-minute blocks around 2015-04-13 16:30:

Good to know!

Zero(!) during 10min after 16:30 doesn't look good. This means the

...
database master did not see any INSERTs with 20150413163* timestamps within the last week.

Ok, makes more sense.

...
Can you describe how you know that events were written normally? Simply a lack of errors from mysql consumer?

MySQL consumer log not only lacks errors, but has records of successful writes to the db at the time of the problems. Also, the processor logs indicate homogeneous throughput of valid events during all times.

*However*, the consumer logs indicate the insert timestamp only, not the event timestamp (which goes to the table). So it could be that there's some data loss inside the consumer code (or in zmq?) that wouldn't stop the write flow, but would skip a segment of events. I'll look deeper into this.

...
Can you supply some specific records from the EL logs with timestamps that should definitely be in the database, so we can scan the database binlog for specific UUIDs or suchlike?

I'll try to get those.

-- can you give me some idea of how long your "at other moments" delay

...
is?

I'll observe the db during the day and give you an estimate.

Thanks!

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Dan Andreescu

2:14 a.m.

...

...
This sounds like the fixes we did last quarter to the batch insertion

basically hid the problem instead of making it go away. I think we are mixing things here, when we had issues with batching code we never saw a pattern of "no-events-whatsoever-in-any-table for an hour". We saw events dropped in bursts here and there but certainly not an "hour long blackout".

Also, there were no events dropped when we did the major backfilling in early march where the db sustained quite a bit of load as we had to insert those one by one.

So (while I am not saying we could not uncover a code issue in our end) we have not seen this particular error pattern before.

I didn't mean to suggest we saw this error before. I was trying to say that intuitively the error seems very similar. That is, over time, the lag grows and at some point it's so big that we lose a bunch of data all at once. I was just saying that because the first place I'd look is at that change. For example, I'd try replicating by simulating the same traffic and then I'd revert to the original logic before batch inserts and try that. We've all looked at this code but we must be missing something big.

Nuria Ruiz

3:14 a.m.

Given that batching code is been deployed since earlier (March16th) than the 1st event listed by Marcel (April 9th) and since then we have swapped the EL box (April 3rd/4th) we probably want to look at system issues.

On my opinion it is probably easier to see with tcpdump whether inserts are being sent, rather than trying to reply traffic to repro the problem.

Thanks,

Nuria

On Wed, Apr 15, 2015 at 5:14 PM, Dan Andreescu dandreescu@wikimedia.org wrote:

...

...
This sounds like the fixes we did last quarter to the batch insertion basically hid the problem instead of making it go away. I think we are mixing things here, when we had issues with batching code we never saw a pattern of "no-events-whatsoever-in-any-table for an hour". We saw events dropped in bursts here and there but certainly not an "hour long blackout".

Also, there were no events dropped when we did the major backfilling in early march where the db sustained quite a bit of load as we had to insert those one by one.

So (while I am not saying we could not uncover a code issue in our end) we have not seen this particular error pattern before.

I didn't mean to suggest we saw this error before. I was trying to say that intuitively the error seems very similar. That is, over time, the lag grows and at some point it's so big that we lose a bunch of data all at once. I was just saying that because the first place I'd look is at that change. For example, I'd try replicating by simulating the same traffic and then I'd revert to the original logic before batch inserts and try that. We've all looked at this code but we must be missing something big.

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Dan Andreescu

5:06 a.m.

Fair enough, just a thought.

On Wednesday, April 15, 2015, Nuria Ruiz nuria@wikimedia.org wrote:

...

Given that batching code is been deployed since earlier (March16th) than the 1st event listed by Marcel (April 9th) and since then we have swapped the EL box (April 3rd/4th) we probably want to look at system issues.

On my opinion it is probably easier to see with tcpdump whether inserts are being sent, rather than trying to reply traffic to repro the problem.

Thanks,

Nuria

On Wed, Apr 15, 2015 at 5:14 PM, Dan Andreescu <dandreescu@wikimedia.org javascript:_e(%7B%7D,'cvml','dandreescu@wikimedia.org');> wrote:

...
...
This sounds like the fixes we did last quarter to the batch insertion basically hid the problem instead of making it go away. I think we are mixing things here, when we had issues with batching code we never saw a pattern of "no-events-whatsoever-in-any-table for an hour". We saw events dropped in bursts here and there but certainly not an "hour long blackout".

Also, there were no events dropped when we did the major backfilling in early march where the db sustained quite a bit of load as we had to insert those one by one.

So (while I am not saying we could not uncover a code issue in our end) we have not seen this particular error pattern before.

I didn't mean to suggest we saw this error before. I was trying to say that intuitively the error seems very similar. That is, over time, the lag grows and at some point it's so big that we lose a bunch of data all at once. I was just saying that because the first place I'd look is at that change. For example, I'd try replicating by simulating the same traffic and then I'd revert to the original logic before batch inserts and try that. We've all looked at this code but we must be missing something big.

Analytics mailing list Analytics@lists.wikimedia.org javascript:_e(%7B%7D,'cvml','Analytics@lists.wikimedia.org'); https://lists.wikimedia.org/mailman/listinfo/analytics

Sean Pringle

12:51 a.m.

On Thu, Apr 16, 2015 at 7:58 AM, Marcel Ruiz Forns mforns@wikimedia.org wrote:

...

Here are three valid events that were apparently inserted correctly, but don't appear in the db. http://pastebin.com/8wm6qkkE (they are performance events and contain no sensitive data)

Thanks. If possible, can we have:

- The exact INSERT statements issued by the MySQL consumer - The UUID values generated for those records

...

I followed the master-slave replication lag for some hours, and perceived a pattern in the lag: It gets progressively bigger with time, more or less with a 10 minute increase per hour, reaching lags of 1 to 2 hours. At that point, the data gap happens and the replication lag goes back to few minutes lag. I could only catch a data gap "live" 2 times, so that's definitely not a conclusive statement. But, there's this hypothesis that the two problems are related.

Just for clarity, may I ask how are you testing this?

-s

Marcel Ruiz Forns

1:21 a.m.

...

Thanks. If possible, can we have:

The exact INSERT statements issued by the MySQL consumer

- The UUID values generated for those records

...

I'll try to get them, sure.

...

...
I followed the master-slave replication lag for some hours, and

perceived a

...
pattern in the lag: It gets progressively bigger with time, more or less with a 10 minute increase per hour, reaching lags of 1 to 2 hours. At

that

...
point, the data gap happens and the replication lag goes back to few

minutes

...
lag. I could only catch a data gap "live" 2 times, so that's definitely

not

...
a conclusive statement. But, there's this hypothesis that the two

problems

...
are related.

Just for clarity, may I ask how are you testing this?

1) To identify the data gaps I used:

select left(timestamp, 11), count(*) from Edit_11448630 where timestamp >= '20150415000000' and timestamp < '20150416000000' group by 1;

Note that the table name and the timestamps can be adapted as necessary. This query returns something like:

+---------------------+----------+ | left(timestamp, 11) | count(*) | +---------------------+----------+ | 20150415000 | 9823 | | 20150415001 | 10158 | | 20150415002 | 9473 | | 20150415003 | 9493 | | 20150415004 | 9297 | | 20150415005 | 9390 | | 20150415010 | 9849 | | 20150415011 | 9619 | | 20150415012 | 10038 | | 20150415013 | 9763 | | 20150415014 | 9750 | | 20150415015 | 9633 | | ... | ... | +---------------------+----------+

Which lists the number of events existing for each 10-minute slot. When there's a data gap, the result of the query looks like this:

+---------------------+----------+ | left(timestamp, 11) | count(*) | +---------------------+----------+ | ... | ... | | 20150415150 | 21237 | | 20150415151 | 20677 | | 20150415152 | 20541 | | 20150415153 | 19671 | | 20150415154 | 19623 | | 20150415155 | 19281 | | 20150415160 | 19243 | | 20150415161 | 5708 | <= Gap: 16:20h and 16:30h have no data! | 20150415164 | 11590 | | 20150415165 | 18745 | | ... | ... | +---------------------+----------+

2) To get the master-slave replication lag I used:

select timestamp from Edit_11448630 order by 1 desc limit 1;

Again, the table name can be substituted. This gives me, supposedly, the timestamp of the last inserted event. Comparing that with the current time, I get the lag.

3) To correlate both, I just happened to be monitoring the progressively increasing replication lag, and after noticing an abrupt recovery of the latter, I checked and found a data gap.

Sean Pringle

10:28 a.m.

On Thu, Apr 16, 2015 at 7:58 AM, Marcel Ruiz Forns mforns@wikimedia.org wrote:

...

I followed the master-slave replication lag for some hours, and perceived a pattern in the lag: It gets progressively bigger with time, more or less with a 10 minute increase per hour, reaching lags of 1 to 2 hours. At that point, the data gap happens and the replication lag goes back to few minutes lag. I could only catch a data gap "live" 2 times, so that's definitely not a conclusive statement. But, there's this hypothesis that the two problems are related.

Today I've run some sync tests between EL master and analytisc-slave. So far I've not found any discrepancies -- the master and slave tables, when replication is caught up(!), have identical data. I infer that the data gaps you found do exist but are not related to replication or replication lag, and are occurring somewhere upstream of analytics-store, either on the EL master (db1046) itself or between the master and the consumer. I'll wait to see the example UUIDs to dig further in the master binary logs.

Regarding the replication lag; a few observations:

- Asynchronous replication will always be susceptible to lag as long as the slave handles other traffic. The fixes done to have the consumer batch-insert records have greatly reduced the lag problem so that we havn't seen 24hour+ lag in months, but asynchronous replication does just what it says on the tin :-)

- An hour or two lag observed infrequently is often due to some *other* activity on the slave. The way to track it down is to first look for patterns -- eg, a certain time of day may indicate a poorly optimized cron job or suchlike. If you do catch replication lag of greater than 5min in the act, view the DB processlist to see what other queries are executing. Check if something is simply hammering the box, or if something is locking records or tables that are attempting to replicate, or ... [insert strange cause here].

BR Sean

Marcel Ruiz Forns

17 Apr 17 Apr

5:11 p.m.

Sean and list,

I think we found the problem:

The data loss is happening within EL consumer code. The error was skillfully dodging the logs, sorry for that.

The root cause is that the db insertion takes too long to keep up with the rate of incoming events, and the events buffer gets big. When big enough, the program crashes and the buffered data is lost.

Thanks Sean for your comments, they helped a lot! Will update the phab task with a detailed explanation and next steps. https://phabricator.wikimedia.org/T96082

Cheers,

Marcel

3549

Age (days ago)

3552

Last active (days ago)

analytics@lists.wikimedia.org

12 comments

4 participants

tags (0)

participants (4)

Dan Andreescu
Marcel Ruiz Forns
Nuria Ruiz
Sean Pringle