Hi,
one might be tempted to think that the pmtpa data center having its servers shut down today should not affect the analytics database slaves, as they come with “eqiad” in their name:
s[1-7]-analytics-slave.eqiad.wmnet
.
But it seems at least our s7 slave went down while pmtpa machines are taken down.
As that looked suspicious, I tried to only look at the first indirection of the slaves' hostnames, which unmasked all our slaves (except s1, and s5) as being just CNAMEs for pmtpa machines:
* s2-analytics-slave.eqiad.wmnet (db69.pmtpa.wmnet) * s3-analytics-slave.eqiad.wmnet (db71.pmtpa.wmnet) * s4-analytics-slave.eqiad.wmnet (db72.pmtpa.wmnet) * s6-analytics-slave.eqiad.wmnet (db74.pmtpa.wmnet) * s7-analytics-slave.eqiad.wmnet (db68.pmtpa.wmnet)
The s7 slave is already being unreachable at the time of writing. I guess due to the pmpta servers being taken offline. And I expect the remaining ones in the above list to go down soon as well.
I filed RT #7330 about it: https://rt.wikimedia.org/Ticket/Display.html?id=7330
Not clear when replacements will be available or how to proceed. Waiting for a response from Ops.
Best regards, Christian
Hi Christian --
Thanks for the heads-up. I've verbally notified Dario and the Research and Data team. They will follow up with tech-ops.
-Toby
On Mon, Apr 21, 2014 at 3:33 PM, Christian Aistleitner < christian@quelltextlich.at> wrote:
Hi,
one might be tempted to think that the pmtpa data center having its servers shut down today should not affect the analytics database slaves, as they come with "eqiad" in their name:
s[1-7]-analytics-slave.eqiad.wmnet
.
But it seems at least our s7 slave went down while pmtpa machines are taken down.
As that looked suspicious, I tried to only look at the first indirection of the slaves' hostnames, which unmasked all our slaves (except s1, and s5) as being just CNAMEs for pmtpa machines:
- s2-analytics-slave.eqiad.wmnet (db69.pmtpa.wmnet)
- s3-analytics-slave.eqiad.wmnet (db71.pmtpa.wmnet)
- s4-analytics-slave.eqiad.wmnet (db72.pmtpa.wmnet)
- s6-analytics-slave.eqiad.wmnet (db74.pmtpa.wmnet)
- s7-analytics-slave.eqiad.wmnet (db68.pmtpa.wmnet)
The s7 slave is already being unreachable at the time of writing. I guess due to the pmpta servers being taken offline. And I expect the remaining ones in the above list to go down soon as well.
I filed RT #7330 about it: https://rt.wikimedia.org/Ticket/Display.html?id=7330
Not clear when replacements will be available or how to proceed. Waiting for a response from Ops.
Best regards, Christian
-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Gruendbergstrasze 65a Email: christian@quelltextlich.at 4040 Linz, Austria Phone: +43 732 / 26 95 63 Fax: +43 732 / 26 95 63 Homepage: http://quelltextlich.at/
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
alright, that’s very unfortunate – thanks Christian for catching this. All these slaves are critical for a variety of scripts that populate dashboards and ad-hoc analysis outside of enwiki and dewiki.
I’ll immediately file an RT ticket.
Dario
On Apr 21, 2014, at 5:27 PM, Toby Negrin tnegrin@wikimedia.org wrote:
Hi Christian --
Thanks for the heads-up. I've verbally notified Dario and the Research and Data team. They will follow up with tech-ops.
-Toby
On Mon, Apr 21, 2014 at 3:33 PM, Christian Aistleitner christian@quelltextlich.at wrote: Hi,
one might be tempted to think that the pmtpa data center having its servers shut down today should not affect the analytics database slaves, as they come with “eqiad” in their name:
s[1-7]-analytics-slave.eqiad.wmnet
.
But it seems at least our s7 slave went down while pmtpa machines are taken down.
As that looked suspicious, I tried to only look at the first indirection of the slaves' hostnames, which unmasked all our slaves (except s1, and s5) as being just CNAMEs for pmtpa machines:
- s2-analytics-slave.eqiad.wmnet (db69.pmtpa.wmnet)
- s3-analytics-slave.eqiad.wmnet (db71.pmtpa.wmnet)
- s4-analytics-slave.eqiad.wmnet (db72.pmtpa.wmnet)
- s6-analytics-slave.eqiad.wmnet (db74.pmtpa.wmnet)
- s7-analytics-slave.eqiad.wmnet (db68.pmtpa.wmnet)
The s7 slave is already being unreachable at the time of writing. I guess due to the pmpta servers being taken offline. And I expect the remaining ones in the above list to go down soon as well.
I filed RT #7330 about it: https://rt.wikimedia.org/Ticket/Display.html?id=7330
Not clear when replacements will be available or how to proceed. Waiting for a response from Ops.
Best regards, Christian
-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Gruendbergstrasze 65a Email: christian@quelltextlich.at 4040 Linz, Austria Phone: +43 732 / 26 95 63 Fax: +43 732 / 26 95 63 Homepage: http://quelltextlich.at/
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
scrap that, I see there’s already an open ticket, I’ll follow up there.
On Apr 21, 2014, at 5:32 PM, Dario Taraborelli dario@wikimedia.org wrote:
alright, that’s very unfortunate – thanks Christian for catching this. All these slaves are critical for a variety of scripts that populate dashboards and ad-hoc analysis outside of enwiki and dewiki.
I’ll immediately file an RT ticket.
Dario
On Apr 21, 2014, at 5:27 PM, Toby Negrin tnegrin@wikimedia.org wrote:
Hi Christian --
Thanks for the heads-up. I've verbally notified Dario and the Research and Data team. They will follow up with tech-ops.
-Toby
On Mon, Apr 21, 2014 at 3:33 PM, Christian Aistleitner christian@quelltextlich.at wrote: Hi,
one might be tempted to think that the pmtpa data center having its servers shut down today should not affect the analytics database slaves, as they come with “eqiad” in their name:
s[1-7]-analytics-slave.eqiad.wmnet
.
But it seems at least our s7 slave went down while pmtpa machines are taken down.
As that looked suspicious, I tried to only look at the first indirection of the slaves' hostnames, which unmasked all our slaves (except s1, and s5) as being just CNAMEs for pmtpa machines:
- s2-analytics-slave.eqiad.wmnet (db69.pmtpa.wmnet)
- s3-analytics-slave.eqiad.wmnet (db71.pmtpa.wmnet)
- s4-analytics-slave.eqiad.wmnet (db72.pmtpa.wmnet)
- s6-analytics-slave.eqiad.wmnet (db74.pmtpa.wmnet)
- s7-analytics-slave.eqiad.wmnet (db68.pmtpa.wmnet)
The s7 slave is already being unreachable at the time of writing. I guess due to the pmpta servers being taken offline. And I expect the remaining ones in the above list to go down soon as well.
I filed RT #7330 about it: https://rt.wikimedia.org/Ticket/Display.html?id=7330
Not clear when replacements will be available or how to proceed. Waiting for a response from Ops.
Best regards, Christian
-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Gruendbergstrasze 65a Email: christian@quelltextlich.at 4040 Linz, Austria Phone: +43 732 / 26 95 63 Fax: +43 732 / 26 95 63 Homepage: http://quelltextlich.at/
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Hi,
all analytics slaves are working again.
On Tue, Apr 22, 2014 at 12:33:09AM +0200, Christian Aistleitner wrote:
The s7 slave is already being unreachable at the time of writing.
s7-analytics-slave.eqiad.wmnet has been updated to point to a different machine, and is usable again.
Thanks to springle for the quick fix!
But Ops also said that anything hammering db1007 too hard will be hastily killed. So please do not run unneeded slow reports on the machine.
Ops are working on getting db68 up again and then make s7-analytics-slave.eqiad.wmnet point to db68 again.
I guess due to the pmpta servers being taken offline. And I expect the remaining ones in the above list to go down soon as well.
That warning turned out to be overcautious. The db68 issue turned out to be an unrelated issue that just happened to occur on the day Ops turned off some pmtpa machines.
Ops told us that the analytics slaves are not going away yet because they are on a different floor of the data center.
Have fun, Christian
Thanks for managing this.
Is there any action required from us? Should we fix those CNAMEs or will that be addressed when we move/replace the slaves?
On Tue, Apr 22, 2014 at 1:20 AM, Christian Aistleitner < christian@quelltextlich.at> wrote:
Hi,
all analytics slaves are working again.
On Tue, Apr 22, 2014 at 12:33:09AM +0200, Christian Aistleitner wrote:
The s7 slave is already being unreachable at the time of writing.
s7-analytics-slave.eqiad.wmnet has been updated to point to a different machine, and is usable again.
Thanks to springle for the quick fix!
But Ops also said that anything hammering db1007 too hard will be hastily killed. So please do not run unneeded slow reports on the machine.
Ops are working on getting db68 up again and then make s7-analytics-slave.eqiad.wmnet point to db68 again.
I guess due to the pmpta servers being taken offline. And I expect the remaining ones in the above list to go down soon as well.
That warning turned out to be overcautious. The db68 issue turned out to be an unrelated issue that just happened to occur on the day Ops turned off some pmtpa machines.
Ops told us that the analytics slaves are not going away yet because they are on a different floor of the data center.
Have fun, Christian
-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Gruendbergstrasze 65a Email: christian@quelltextlich.at 4040 Linz, Austria Phone: +43 732 / 26 95 63 Fax: +43 732 / 26 95 63 Homepage: http://quelltextlich.at/
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Hi Toby,
On Tue, Apr 22, 2014 at 04:35:17PM -0700, Toby Negrin wrote:
Is there any action required from us?
I am not aware of immediate actions that we need to take.
Springle commented on RT #7330 that the analytics slaves are not going away yet.
Should we fix those CNAMEs or will that be addressed when we move/replace the slaves?
I have no clue.
Due to Springle's comment above (which I read as “we're save for at least a few days”), I would have waited for the dust to settle around the Tampa move, and then would have asked him about the plan to move forward.
Do we have any preferences, desires, or expectations that we need to make sure get addressed?
Best regards, Christian