See message below about a network outage currently affecting multiple
servers in eqiad. The set of affected servers includes gadolinium
and hafnium, so udp2log-based web request logging and EventLogging-based
metric reporters (e.g., Navigation Timing stats) are affected.
---------- Forwarded message ----------
From: Brandon Black <bblack(a)wikimedia.org>
Date: Sat, Nov 29, 2014 at 9:54 PM
Subject: [Ops] Network outage for rack C4 in eqiad
To: Operations Engineers <ops(a)lists.wikimedia.org>
We've lost network access (but not console access) to all the machines in
eqiad rack C4 as of ~ 03:50 UTC (about 2 hours back from this email). This
is mostly machines in a supporting role; no direct traffic front ends or
app servers, etc. Phabricator is down as a result, as are a few monitoring
-related bits and pieces.
In the logs of asw-c-eqiad, it looks like the virtual chassis member for C4
was "removed" (log paste below). I haven't found any useful remote way to
try to make that virtual chassis member restart yet. I'm not sure if it's
worth waking anyone up in the middle of the night or anything at this
point. Most likely this is going to involve some physical presence (or
remote hands) at eqiad.
-------------------------------
Nov 30 03:50:14 asw-c-eqiad /kernel: peer_inputs:3690 VKS0 closing
connection peer type 24 indx 4 err 5
Nov 30 03:50:15 asw-c-eqiad chassism[1093]: CM_CHANGE: Member 1->1, Mode
M->M, 1M 8B, GID 0, Master Unchanged, Members Changed
Nov 30 03:50:15 asw-c-eqiad chassism[1093]: CM_CHANGE: 1M 2L 3L 5L 6L 7L 8B
Nov 30 03:50:15 asw-c-eqiad chassism[1093]: CM_CHANGE: Signaling license
service
Nov 30 03:50:15 asw-c-eqiad chassisd[1512]: CHASSISD_SNMP_TRAP7: SNMP trap
generated: FRU removal (jnxFruContentsIndex 7, jnxFruL1Index 5,
jnxFruL2Index 0, jnxFruL3Index 0, jnxFruName FPC: EX4200-48T, 8 POE @
4/*/*, jnxFruType 3, jnxFruSlot 4)
Nov 30 03:50:15 asw-c-eqiad chassisd[1512]: CHASSISD_FRU_OFFLINE_NOTICE:
Taking FPC 4 offline: Removal
Nov 30 03:50:15 asw-c-eqiad chassisd[1512]:
CHASSISD_IPC_CONNECTION_DROPPED: Dropped IPC connection for FPC 4
Nov 30 03:50:15 asw-c-eqiad chassisd[1512]: CHASSISD_IFDEV_DETACH_FPC:
ifdev_detach_fpc(4)
Nov 30 03:50:15 asw-c-eqiad chassism[1093]: mvlan_member_change_delete:
member id 4 (my member id 1, my role 1)
Nov 30 03:50:15 asw-c-eqiad chassism[1093]: mvlan_delete_ifl: IFL
resources for bme0.32773 (ifl_index 10) deleted
Nov 30 03:50:15 asw-c-eqiad init: can not access /usr/sbin/smihelperd: No
such file or directory
Nov 30 03:50:15 asw-c-eqiad init: subscriber-management-helper (PID 0)
started
Nov 30 03:50:16 asw-c-eqiad vccpd[1095]: Member 3, interface vcp-0.32768
came up
Nov 30 03:50:16 asw-c-eqiad vccpd[1095]: Member 7, interface vcp-1.32768
came up
Nov 30 03:50:16 asw-c-eqiad vccpd[1095]: Member 8, interface vcp-1.32768
came up
Nov 30 03:50:16 asw-c-eqiad vccpd[1095]: Member 6, interface vcp-1.32768
went down
Nov 30 03:50:16 asw-c-eqiad vccpd[1095]: Member 2, interface vcp-1.32768
came up
Nov 30 03:50:17 asw-c-eqiad fpc5
MRVL-L2:mrvl_brg_port_stg_entry_unset(),410:l2ifl not found! ifl 350
Nov 30 03:50:17 asw-c-eqiad /kernel: RT_PFE: RT msg op 2 (PREFIX DELETE)
failed, err 5 (Invalid)
Nov 30 03:50:17 asw-c-eqiad last message repeated 5 times
Nov 30 03:50:18 asw-c-eqiad fpc5
MRVL-L2:mrvl_brg_port_stg_delete(),652:Port-STG-UnSet failed(Invalid
Params:-2)
Nov 30 03:50:18 asw-c-eqiad /kernel: RT_PFE: RT msg op 2 (PREFIX DELETE)
failed, err 5 (Invalid)
Nov 30 03:50:18 asw-c-eqiad last message repeated 5 times
Nov 30 03:50:19 asw-c-eqiad fpc5 RT-HAL,rt_entry_delete_msg_proc,3539:
l2_halp_vectors->delete failed proto MSTI, len 48 prefix 00350:00254
------------------------------
_______________________________________________
Ops mailing list
Ops(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/ops