See message below about a network outage currently affecting multiple servers in eqiad. The set of affected servers includes gadolinium and hafnium, so udp2log-based web request logging and EventLogging-based metric reporters (e.g., Navigation Timing stats) are affected.

---------- Forwarded message ----------
From: Brandon Black <bblack@wikimedia.org>
Date: Sat, Nov 29, 2014 at 9:54 PM
Subject: [Ops] Network outage for rack C4 in eqiad
To: Operations Engineers <ops@lists.wikimedia.org>



We've lost network access (but not console access) to all the machines in eqiad rack C4 as of ~ 03:50 UTC (about 2 hours back from this email).  This is mostly machines in a supporting role; no direct traffic front ends or app servers, etc.  Phabricator is down as a result, as are a few monitoring -related bits and pieces.

In the logs of asw-c-eqiad, it looks like the virtual chassis member for C4 was "removed" (log paste below).  I haven't found any useful remote way to try to make that virtual chassis member restart yet.  I'm not sure if it's worth waking anyone up in the middle of the night or anything at this point.  Most likely this is going to involve some physical presence (or remote hands) at eqiad.

-------------------------------
Nov 30 03:50:14  asw-c-eqiad /kernel: peer_inputs:3690 VKS0 closing connection peer type 24 indx 4 err 5
Nov 30 03:50:15  asw-c-eqiad chassism[1093]: CM_CHANGE: Member 1->1, Mode M->M, 1M 8B, GID 0, Master Unchanged, Members Changed
Nov 30 03:50:15  asw-c-eqiad chassism[1093]: CM_CHANGE: 1M 2L 3L 5L 6L 7L 8B
Nov 30 03:50:15  asw-c-eqiad chassism[1093]: CM_CHANGE: Signaling license service
Nov 30 03:50:15  asw-c-eqiad chassisd[1512]: CHASSISD_SNMP_TRAP7: SNMP trap generated: FRU removal (jnxFruContentsIndex 7, jnxFruL1Index 5, jnxFruL2Index 0, jnxFruL3Index 0, jnxFruName FPC: EX4200-48T, 8 POE @ 4/*/*, jnxFruType 3, jnxFruSlot 4)
Nov 30 03:50:15  asw-c-eqiad chassisd[1512]: CHASSISD_FRU_OFFLINE_NOTICE: Taking FPC 4 offline: Removal
Nov 30 03:50:15  asw-c-eqiad chassisd[1512]: CHASSISD_IPC_CONNECTION_DROPPED: Dropped IPC connection for FPC 4
Nov 30 03:50:15  asw-c-eqiad chassisd[1512]: CHASSISD_IFDEV_DETACH_FPC: ifdev_detach_fpc(4)
Nov 30 03:50:15  asw-c-eqiad chassism[1093]: mvlan_member_change_delete: member id 4 (my member id 1, my role 1)
Nov 30 03:50:15  asw-c-eqiad chassism[1093]: mvlan_delete_ifl: IFL resources for bme0.32773 (ifl_index 10) deleted
Nov 30 03:50:15  asw-c-eqiad init: can not access /usr/sbin/smihelperd: No such file or directory
Nov 30 03:50:15  asw-c-eqiad init: subscriber-management-helper (PID 0) started
Nov 30 03:50:16  asw-c-eqiad vccpd[1095]: Member 3, interface vcp-0.32768 came up
Nov 30 03:50:16  asw-c-eqiad vccpd[1095]: Member 7, interface vcp-1.32768 came up
Nov 30 03:50:16  asw-c-eqiad vccpd[1095]: Member 8, interface vcp-1.32768 came up
Nov 30 03:50:16  asw-c-eqiad vccpd[1095]: Member 6, interface vcp-1.32768 went down
Nov 30 03:50:16  asw-c-eqiad vccpd[1095]: Member 2, interface vcp-1.32768 came up
Nov 30 03:50:17  asw-c-eqiad fpc5 MRVL-L2:mrvl_brg_port_stg_entry_unset(),410:l2ifl not found! ifl 350
Nov 30 03:50:17  asw-c-eqiad /kernel: RT_PFE: RT msg op 2 (PREFIX DELETE) failed, err 5 (Invalid)
Nov 30 03:50:17  asw-c-eqiad last message repeated 5 times
Nov 30 03:50:18  asw-c-eqiad fpc5 MRVL-L2:mrvl_brg_port_stg_delete(),652:Port-STG-UnSet failed(Invalid Params:-2)
Nov 30 03:50:18  asw-c-eqiad /kernel: RT_PFE: RT msg op 2 (PREFIX DELETE) failed, err 5 (Invalid)
Nov 30 03:50:18  asw-c-eqiad last message repeated 5 times
Nov 30 03:50:19  asw-c-eqiad fpc5 RT-HAL,rt_entry_delete_msg_proc,3539: l2_halp_vectors->delete failed proto MSTI, len 48 prefix 00350:00254
------------------------------

_______________________________________________
Ops mailing list
Ops@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/ops