Hi Marcel
On Wed, Apr 15, 2015 at 2:23 AM, Marcel Ruiz Forns <mforns(a)wikimedia.org> wrote:
Hi Sean,
Here's Marcel from Analytics.
I'd like to comment with you some strange behaviors that we've observed on
EventLogging database (m4-master.equiad.wmnet).
1) There are some time spans where there is no data in any table. Examples
follow:
2015-04-09 17:20 -> 18:35
2015-04-11 03:35 -> 05:20
2015-04-11 14:00 -> 15:20
2015-04-11 19:00 -> 20:00
2015-04-12 14:30 -> 15:40
2015-04-13 11:35 -> 12:30
2015-04-13 16:30 -> 17:35
This is happening a lot, as you can see, so we are really concerned about
it.
These outages are not explained by any of the EventLogging logs (processor
logs, consumer logs) which confirm that the events were actually sent
normally to the database (and written) for those time spans.
We have a binary log on the EL master that holds the last week of
INSERT statements. It can be dumped and grepped, eg looking at
10-minute blocks around 2015-04-13 16:30:
root@db1046:# /opt/wmf-mariadb10/bin/mysqlbinlog db1046-bin.000* |
grep INSERT | grep '20150413161' | wc -l
26849
root@db1046:# /opt/wmf-mariadb10/bin/mysqlbinlog db1046-bin.000* |
grep INSERT | grep '20150413163' | wc -l
0
Zero(!) during 10min after 16:30 doesn't look good. This means the
database master did not see any INSERTs with 20150413163* timestamps
within the last week. Can you describe how you know that events were
written normally? Simply a lack of errors from mysql consumer?
Can you supply some specific records from the EL logs with timestamps
that should definitely be in the database, so we can scan the database
binlog for specific UUIDs or suchlike?
2) Maybe it has something to do with it, or not. But we observed that
sometimes the last events inserted in the tables date from 30mins - 1h30mins
in the past. At other moments, the last inserts date from seconds ago. This
is maybe expected due to slave replication but we're not sure.
Simple answer is: Yes, lag is quite possible given the high write rate
of EL, and can also be affected by load from other replication streams
of the production wikis (cumulative and affected by time of day,
spikes, jobrunners, etc). It's possible we could tune it better though
-- can you give me some idea of how long your "at other moments" delay
is?
BR
Sean