Hi,
just a quick heads up that the replication lag on analytics-store.eqiad.wmnet (aka “The one machine to rule them all”) has risen to >12 hours for s1 replicas. Other replicas are fine.
So on analytics-store.eqiad.wmnet: * enwiki is affected. * log (EventLogging) is affected.
Other databases (like dewiki, eswiki, ...) on analytics-store.eqiad.wmnet are /not/ affected.
For queries that only rely on enwiki, or log, you can use
s1-analytics-slave.eqiad.wmnet
as drop in replacement. enwiki and log are not lagging there.
I filed RT ticket 8032: https://rt.wikimedia.org/Ticket/Display.html?id=8032
Best regards, Christian
This might be me. Killing the query I'm worried about. I'll report back.
On Tue, Jul 29, 2014 at 5:46 PM, Christian Aistleitner < christian@quelltextlich.at> wrote:
Hi,
just a quick heads up that the replication lag on analytics-store.eqiad.wmnet (aka “The one machine to rule them all”) has risen to >12 hours for s1 replicas. Other replicas are fine.
So on analytics-store.eqiad.wmnet:
- enwiki is affected.
- log (EventLogging) is affected.
Other databases (like dewiki, eswiki, ...) on analytics-store.eqiad.wmnet are /not/ affected.
For queries that only rely on enwiki, or log, you can use
s1-analytics-slave.eqiad.wmnet
as drop in replacement. enwiki and log are not lagging there.
I filed RT ticket 8032: https://rt.wikimedia.org/Ticket/Display.html?id=8032
Best regards, Christian
-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian@quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Looks like I was right.
I had a query writing to a table. I thought it would finish last night, but it ran for > 24 hours. I've killed it and changed the query so that it will write to an output file instead. I restarted the query, but now lag seems to be recovering. Lag is ~14 hours for enwiki and ~ 18 hours for log.
-Aaron
On Tue, Jul 29, 2014 at 5:56 PM, Aaron Halfaker ahalfaker@wikimedia.org wrote:
This might be me. Killing the query I'm worried about. I'll report back.
On Tue, Jul 29, 2014 at 5:46 PM, Christian Aistleitner < christian@quelltextlich.at> wrote:
Hi,
just a quick heads up that the replication lag on analytics-store.eqiad.wmnet (aka “The one machine to rule them all”) has risen to >12 hours for s1 replicas. Other replicas are fine.
So on analytics-store.eqiad.wmnet:
- enwiki is affected.
- log (EventLogging) is affected.
Other databases (like dewiki, eswiki, ...) on analytics-store.eqiad.wmnet are /not/ affected.
For queries that only rely on enwiki, or log, you can use
s1-analytics-slave.eqiad.wmnet
as drop in replacement. enwiki and log are not lagging there.
I filed RT ticket 8032: https://rt.wikimedia.org/Ticket/Display.html?id=8032
Best regards, Christian
-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian@quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Lag is down to ~ 16 hours for log and ~ 6 hours for enwiki. I'm declaring victory. Should be good by morning.
On Tue, Jul 29, 2014 at 6:04 PM, Aaron Halfaker ahalfaker@wikimedia.org wrote:
Looks like I was right.
I had a query writing to a table. I thought it would finish last night, but it ran for > 24 hours. I've killed it and changed the query so that it will write to an output file instead. I restarted the query, but now lag seems to be recovering. Lag is ~14 hours for enwiki and ~ 18 hours for log.
-Aaron
On Tue, Jul 29, 2014 at 5:56 PM, Aaron Halfaker ahalfaker@wikimedia.org wrote:
This might be me. Killing the query I'm worried about. I'll report back.
On Tue, Jul 29, 2014 at 5:46 PM, Christian Aistleitner < christian@quelltextlich.at> wrote:
Hi,
just a quick heads up that the replication lag on analytics-store.eqiad.wmnet (aka “The one machine to rule them all”) has risen to >12 hours for s1 replicas. Other replicas are fine.
So on analytics-store.eqiad.wmnet:
- enwiki is affected.
- log (EventLogging) is affected.
Other databases (like dewiki, eswiki, ...) on analytics-store.eqiad.wmnet are /not/ affected.
For queries that only rely on enwiki, or log, you can use
s1-analytics-slave.eqiad.wmnet
as drop in replacement. enwiki and log are not lagging there.
I filed RT ticket 8032: https://rt.wikimedia.org/Ticket/Display.html?id=8032
Best regards, Christian
-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian@quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Hi Aaron,
On Tue, Jul 29, 2014 at 08:11:22PM -0500, Aaron Halfaker wrote:
Lag is down to ~ 16 hours for log and ~ 6 hours for enwiki. I'm declaring victory. Should be good by morning.
Thanks! Lag on enwiki is gone. \o/
Lag on log is currently still ~ 16 hours. I'll keep an eye on it.
Have fun, Christian
On Wed, Jul 30, 2014 at 4:40 PM, Christian Aistleitner < christian@quelltextlich.at> wrote:
Lag on log is currently still ~ 16 hours. I'll keep an eye on it.
https://bugzilla.wikimedia.org/show_bug.cgi?id=67450
EL could be much less susceptible to lag, and recover faster when it occurs.
Hi Sean,
On Thu, Jul 31, 2014 at 12:19:33PM +1000, Sean Pringle wrote:
On Wed, Jul 30, 2014 at 4:40 PM, Christian Aistleitner < christian@quelltextlich.at> wrote:
Lag on log is currently still ~ 16 hours. I'll keep an eye on it.
Full ACK.
I was really glad when I saw the bug getting added back then.
However, given what I heard about our investment in EventLogging, I doubt that it'll get prioritized soon :-(
EL could be much less susceptible to lag, and recover faster when it occurs.
Full ACK.
Although the final problematic query finished yesterday, with the current rate of replication lag recovery, it'll take us until tomorrow to have fully recovered :-(
Have fun, Christian
On Thu, Jul 31, 2014 at 9:05 PM, Christian Aistleitner < christian@quelltextlich.at> wrote:
Although the final problematic query finished yesterday, with the current rate of replication lag recovery, it'll take us until tomorrow to have fully recovered :-(
That slow recovery is not entirely EL's fault. The release team is running a couple of jobs populating new fields in production *links tables on commons and wikidata. A challenging week for slave disk io :-)
I just added the bug to the Scrumbugs backlog.
Christian, you're right about it not getting prioritized soon. We'll go through the backlog again in September while doing some release planning for the next quarter.
On Thu, Jul 31, 2014 at 4:05 AM, Christian Aistleitner < christian@quelltextlich.at> wrote:
Hi Sean,
On Thu, Jul 31, 2014 at 12:19:33PM +1000, Sean Pringle wrote:
On Wed, Jul 30, 2014 at 4:40 PM, Christian Aistleitner < christian@quelltextlich.at> wrote:
Lag on log is currently still ~ 16 hours. I'll keep an eye on it.
Full ACK.
I was really glad when I saw the bug getting added back then.
However, given what I heard about our investment in EventLogging, I doubt that it'll get prioritized soon :-(
EL could be much less susceptible to lag, and recover faster when it
occurs.
Full ACK.
Although the final problematic query finished yesterday, with the current rate of replication lag recovery, it'll take us until tomorrow to have fully recovered :-(
Have fun, Christian
-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian@quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Hi,
On Wed, Jul 30, 2014 at 12:46:25AM +0200, Christian Aistleitner wrote:
So on analytics-store.eqiad.wmnet:
- enwiki is affected.
- log (EventLogging) is affected.
both databases are back to normal.
Have fun, Christian