See message below about a network outage currently affecting multiple servers in eqiad. The set of affected servers includes gadolinium and hafnium, so udp2log-based web request logging and EventLogging-based metric reporters (e.g., Navigation Timing stats) are affected.
---------- Forwarded message ---------- From: Brandon Black bblack@wikimedia.org Date: Sat, Nov 29, 2014 at 9:54 PM Subject: [Ops] Network outage for rack C4 in eqiad To: Operations Engineers ops@lists.wikimedia.org
We've lost network access (but not console access) to all the machines in eqiad rack C4 as of ~ 03:50 UTC (about 2 hours back from this email). This is mostly machines in a supporting role; no direct traffic front ends or app servers, etc. Phabricator is down as a result, as are a few monitoring -related bits and pieces.
In the logs of asw-c-eqiad, it looks like the virtual chassis member for C4 was "removed" (log paste below). I haven't found any useful remote way to try to make that virtual chassis member restart yet. I'm not sure if it's worth waking anyone up in the middle of the night or anything at this point. Most likely this is going to involve some physical presence (or remote hands) at eqiad.
------------------------------- Nov 30 03:50:14 asw-c-eqiad /kernel: peer_inputs:3690 VKS0 closing connection peer type 24 indx 4 err 5 Nov 30 03:50:15 asw-c-eqiad chassism[1093]: CM_CHANGE: Member 1->1, Mode M->M, 1M 8B, GID 0, Master Unchanged, Members Changed Nov 30 03:50:15 asw-c-eqiad chassism[1093]: CM_CHANGE: 1M 2L 3L 5L 6L 7L 8B Nov 30 03:50:15 asw-c-eqiad chassism[1093]: CM_CHANGE: Signaling license service Nov 30 03:50:15 asw-c-eqiad chassisd[1512]: CHASSISD_SNMP_TRAP7: SNMP trap generated: FRU removal (jnxFruContentsIndex 7, jnxFruL1Index 5, jnxFruL2Index 0, jnxFruL3Index 0, jnxFruName FPC: EX4200-48T, 8 POE @ 4/*/*, jnxFruType 3, jnxFruSlot 4) Nov 30 03:50:15 asw-c-eqiad chassisd[1512]: CHASSISD_FRU_OFFLINE_NOTICE: Taking FPC 4 offline: Removal Nov 30 03:50:15 asw-c-eqiad chassisd[1512]: CHASSISD_IPC_CONNECTION_DROPPED: Dropped IPC connection for FPC 4 Nov 30 03:50:15 asw-c-eqiad chassisd[1512]: CHASSISD_IFDEV_DETACH_FPC: ifdev_detach_fpc(4) Nov 30 03:50:15 asw-c-eqiad chassism[1093]: mvlan_member_change_delete: member id 4 (my member id 1, my role 1) Nov 30 03:50:15 asw-c-eqiad chassism[1093]: mvlan_delete_ifl: IFL resources for bme0.32773 (ifl_index 10) deleted Nov 30 03:50:15 asw-c-eqiad init: can not access /usr/sbin/smihelperd: No such file or directory Nov 30 03:50:15 asw-c-eqiad init: subscriber-management-helper (PID 0) started Nov 30 03:50:16 asw-c-eqiad vccpd[1095]: Member 3, interface vcp-0.32768 came up Nov 30 03:50:16 asw-c-eqiad vccpd[1095]: Member 7, interface vcp-1.32768 came up Nov 30 03:50:16 asw-c-eqiad vccpd[1095]: Member 8, interface vcp-1.32768 came up Nov 30 03:50:16 asw-c-eqiad vccpd[1095]: Member 6, interface vcp-1.32768 went down Nov 30 03:50:16 asw-c-eqiad vccpd[1095]: Member 2, interface vcp-1.32768 came up Nov 30 03:50:17 asw-c-eqiad fpc5 MRVL-L2:mrvl_brg_port_stg_entry_unset(),410:l2ifl not found! ifl 350 Nov 30 03:50:17 asw-c-eqiad /kernel: RT_PFE: RT msg op 2 (PREFIX DELETE) failed, err 5 (Invalid) Nov 30 03:50:17 asw-c-eqiad last message repeated 5 times Nov 30 03:50:18 asw-c-eqiad fpc5 MRVL-L2:mrvl_brg_port_stg_delete(),652:Port-STG-UnSet failed(Invalid Params:-2) Nov 30 03:50:18 asw-c-eqiad /kernel: RT_PFE: RT msg op 2 (PREFIX DELETE) failed, err 5 (Invalid) Nov 30 03:50:18 asw-c-eqiad last message repeated 5 times Nov 30 03:50:19 asw-c-eqiad fpc5 RT-HAL,rt_entry_delete_msg_proc,3539: l2_halp_vectors->delete failed proto MSTI, len 48 prefix 00350:00254 ------------------------------
_______________________________________________ Ops mailing list Ops@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/ops