Hi Everyone -
This issue appears to be patched up. Please let me know immediately if you see any more network issues.
Longer explanation - the root cause of issues we saw today was a "fixed" router bug (our code version should not have been affected). When in a firewall filter, packets are rejected (which sends an ICMP rejected notice), the routing engine can receive too many of these requests, causing the routing engine to "choke" on its backlog of requests. This backup caused packets destined to the routing engine to drop. This caused several issues as VRRP, BFD, and BGP all stopped processing. For a currently unknown reason, OSPF was unaffected.
After correcting this, for an unknown reason, one vlan was not processing packets destined to the routing engine, while the other vlans were properly processing these packets. This caused both of our main routers on that vlan to claim VRRP mastership - basically causing two routers to claim to be the default gateway for the subnet which contains the LVS servers. After disabling VRRP, the router still was not passing traffic destined to this vlan. Turning down the vlan and then turning it back up and adding and removing an arp policer (yes, turning it off and on again) fixed this situation. This vlan issue caused a public facing outage.
The current status is that everything is working and cr2-pmtpa is the VRRP master for all of Tampa. We were lucky that this bug hit cr1-sdtpa much harder than cr2-pmtpa. Eqiad was not affected, and while we cannot yet say definitively, I believe it is due to the more powerful routing engines and more robust network design of the eqiad datacenter and routers. Software upgrades and configuration changes should fix this issue in Tampa. A possible fix would be hardware upgrades of the core routers, however it may be both prohibitively expensive and require some downtime for important machines in pmtpa.
Leslie
On Mon, Jul 2, 2012 at 3:03 PM, Ct Woo ctwoo@wikimedia.org wrote:
All,
The Technical Operations team noticed abnormal network package losses sometime after yesterday's 'leap second' switch (midnight UTC). While it does not seem to impact the site availability at this moment, it is a concern. We are still not sure if is even related to the 'leap second' switch yet.
Leslie has opened a ticket with our network equipment provider and together with Mark, they have been working with them to pinpoint the problem since this morning. It is possible that they might induce some latency/issue during the troubleshooting process.
If you do experience anything abnormal, please let us know (email to ops@wikimedia.org or find us at the #wikimedia-operations IRC channel).
Thanks, CT _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l