Hi Marcel
On Wed, Apr 15, 2015 at 2:23 AM, Marcel Ruiz Forns mforns@wikimedia.org wrote:
Hi Sean,
Here's Marcel from Analytics.
I'd like to comment with you some strange behaviors that we've observed on EventLogging database (m4-master.equiad.wmnet).
- There are some time spans where there is no data in any table. Examples
follow:
2015-04-09 17:20 -> 18:35 2015-04-11 03:35 -> 05:20 2015-04-11 14:00 -> 15:20 2015-04-11 19:00 -> 20:00 2015-04-12 14:30 -> 15:40 2015-04-13 11:35 -> 12:30 2015-04-13 16:30 -> 17:35
This is happening a lot, as you can see, so we are really concerned about it. These outages are not explained by any of the EventLogging logs (processor logs, consumer logs) which confirm that the events were actually sent normally to the database (and written) for those time spans.
We have a binary log on the EL master that holds the last week of INSERT statements. It can be dumped and grepped, eg looking at 10-minute blocks around 2015-04-13 16:30:
root@db1046:# /opt/wmf-mariadb10/bin/mysqlbinlog db1046-bin.000* | grep INSERT | grep '20150413161' | wc -l 26849
root@db1046:# /opt/wmf-mariadb10/bin/mysqlbinlog db1046-bin.000* | grep INSERT | grep '20150413163' | wc -l 0
Zero(!) during 10min after 16:30 doesn't look good. This means the database master did not see any INSERTs with 20150413163* timestamps within the last week. Can you describe how you know that events were written normally? Simply a lack of errors from mysql consumer?
Can you supply some specific records from the EL logs with timestamps that should definitely be in the database, so we can scan the database binlog for specific UUIDs or suchlike?
- Maybe it has something to do with it, or not. But we observed that
sometimes the last events inserted in the tables date from 30mins - 1h30mins in the past. At other moments, the last inserts date from seconds ago. This is maybe expected due to slave replication but we're not sure.
Simple answer is: Yes, lag is quite possible given the high write rate of EL, and can also be affected by load from other replication streams of the production wikis (cumulative and affected by time of day, spikes, jobrunners, etc). It's possible we could tune it better though -- can you give me some idea of how long your "at other moments" delay is?
BR Sean