Re: [Wikitech-l] potential network issue due to packet losses

2 Jul 2012


      Hi Everyone -
This issue appears to be patched up. Please let me know immediately if
you see any more network issues.
Longer explanation - the root cause of issues we saw today was a
"fixed" router bug (our code version should not have been affected).
When in a firewall filter, packets are rejected (which sends an ICMP
rejected notice), the routing engine can receive too many of these
requests, causing the routing engine to "choke" on its backlog of
requests. This backup caused packets destined to the routing engine to
drop.  This caused several issues as VRRP, BFD, and BGP all stopped
processing. For a currently unknown reason, OSPF was unaffected.
After correcting this, for an unknown reason, one vlan was not
processing packets destined to the routing engine, while the other
vlans were properly processing these packets.  This caused both of our
main routers on that vlan to claim VRRP mastership - basically causing
two routers to claim to be the default gateway for the subnet which
contains the LVS servers.  After disabling VRRP, the router still was
not passing traffic destined to this vlan.  Turning down the vlan and
then turning it back up and adding and removing an arp policer (yes,
turning it off and on again) fixed this situation.  This vlan issue
caused a public facing outage.
The current status is that everything is working and cr2-pmtpa is the
VRRP master for all of Tampa.  We were lucky that this bug hit
cr1-sdtpa much harder than cr2-pmtpa.  Eqiad was not affected, and
while we cannot yet say definitively, I believe it is due to the more
powerful routing engines and more robust network design of the eqiad
datacenter and routers.  Software upgrades and configuration changes
should fix this issue in Tampa.  A possible fix would be hardware
upgrades of the core routers, however it may be both prohibitively
expensive and require some downtime for important machines in pmtpa.
Leslie
On Mon, Jul 2, 2012 at 3:03 PM, Ct Woo ctwoo@wikimedia.org wrote:
...
All,
The Technical Operations team noticed abnormal network package losses
sometime after yesterday's 'leap second' switch (midnight UTC).  While it
does not seem to impact the site availability at this moment, it is a
concern. We are still not sure if is even related to the 'leap second'
switch yet.
Leslie has opened a ticket with our network equipment provider and together
with Mark, they have been working with them to pinpoint the problem since
this morning. It is possible that they might induce some latency/issue
during the troubleshooting process.
If you do experience anything abnormal, please let us know (email to
ops@wikimedia.org or find us at the #wikimedia-operations IRC channel).
Thanks,
CT
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-- 
Leslie Carr
Wikimedia Foundation
AS 14907, 43821
http://as14907.peeringdb.com/

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] potential network issue due to packet losses