[Also posting to Bugzilla]
According to the ops team, there are a number of separate and
unrelated ops issues that have come up in the last few days:
1) Not all users are experiencing slowness, but a subset of users are.
There's no definite smoking gun, but the most likely cause are ongoing
issues with one of our routers in Tampa. The router will have to be
taken down for maintenance to fix this issue, and order to perform
this maintenance operation with minimal disruption, we need to have
key ops engineers on standby to deal with any issues that may arise.
My understanding is that the best available maintenance window is
Tuesday next week.
2) There was a software deployment on May 18 which caused an
application server overload; it was reverted the same day.
3) The mobile servers are currently intermittently overloaded,
throwing internal server errors, and servers to provide additional
capacity have been racked today.
4) In case you're looking at it,
ganglia.wikimedia.org is not
displaying correct server status information (as of yesterday); it's
in the process of being fixed.
We're still in the process of setting up a new primary data center
location in Ashburn, VA, which will give us higher site reliability in
general, and also create the possibility of safe failover in
maintenance or emergency situations.
Thank you for this, Erik. Even this computer-challenged person could
understand what you wrote :-).
Be healthy,
Marc Riddell