I saw a WMF tweet of a site outage (network?) around 9:30am Pacific time, by the time I could check now things seem ok on en
Any news on root cause?
George William Herbert Sent from my iPhone
Hi,
Le jeudi 5 février 2015, 09:58:01 George Herbert a écrit :
I saw a WMF tweet of a site outage (network?) around 9:30am Pacific time, by the time I could check now things seem ok on en
Sites are mostly back up but there are still issues with login, so the Ops team hasn't had time to write a postmortem yet.
Hi all,
We've indeed had a total site outage for roughly 30 minutes. We're still collecting all data, but we've tracked down the cause to multiple cascading issues including loss of power to a critical SPOF network switch and HHVM MediaWiki application servers getting blocked due to multiple unoptimal timeout settings. We'll post a full incident report soon, and work to correct the underlying issues as soon as possible.
Our apologies,
On Thu, Feb 5, 2015 at 7:03 PM, Guillaume Paumier gpaumier@wikimedia.org wrote:
Hi,
Le jeudi 5 février 2015, 09:58:01 George Herbert a écrit :
I saw a WMF tweet of a site outage (network?) around 9:30am Pacific
time, by
the time I could check now things seem ok on en
Sites are mostly back up but there are still issues with login, so the Ops team hasn't had time to write a postmortem yet.
-- Guillaume Paumier
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
The incident report is now posted on wikitech:
https://wikitech.wikimedia.org/wiki/Incident_documentation/20150205-SiteOuta...
On Thu, Feb 5, 2015 at 7:57 PM, Mark Bergsma mark@wikimedia.org wrote:
Hi all,
We've indeed had a total site outage for roughly 30 minutes. We're still collecting all data, but we've tracked down the cause to multiple cascading issues including loss of power to a critical SPOF network switch and HHVM MediaWiki application servers getting blocked due to multiple unoptimal timeout settings. We'll post a full incident report soon, and work to correct the underlying issues as soon as possible.
Our apologies,
On Thu, Feb 5, 2015 at 7:03 PM, Guillaume Paumier gpaumier@wikimedia.org wrote:
Hi,
Le jeudi 5 février 2015, 09:58:01 George Herbert a écrit :
I saw a WMF tweet of a site outage (network?) around 9:30am Pacific
time, by
the time I could check now things seem ok on en
Sites are mostly back up but there are still issues with login, so the Ops team hasn't had time to write a postmortem yet.
-- Guillaume Paumier
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-- Mark Bergsma mark@wikimedia.org Lead Operations Architect Director of Technical Operations Wikimedia Foundation
On Thu, 2015-02-05 at 09:58 -0800, George Herbert wrote:
Any news on root cause?
All work in progress: Documentation is linked from the desc in https://phabricator.wikimedia.org/tag/incident-20150205-siteoutage/ and that project also lists potential followup actions as tasks.
andre
Interesting. I thought that WMF had full redundancy among at least two physical sites for tech infrastructure, in case any location went completely offline, such as if there is a fire or flood. Is this not the case, and if so, is full redundancy capability planned?
Pine
*This is an Encyclopedia* https://www.wikipedia.org/
*One gateway to the wide garden of knowledge, where lies The deep rock of our past, in which we must delve The well of our future,The clear water we must leave untainted for those who come after us,The fertile earth, in which truth may grow in bright places, tended by many hands,And the broad fall of sunshine, warming our first steps toward knowing how much we do not know.*
*—Catherine Munro*
On Thu, Feb 5, 2015 at 12:23 PM, Andre Klapper aklapper@wikimedia.org wrote:
On Thu, 2015-02-05 at 09:58 -0800, George Herbert wrote:
Any news on root cause?
All work in progress: Documentation is linked from the desc in https://phabricator.wikimedia.org/tag/incident-20150205-siteoutage/ and that project also lists potential followup actions as tasks.
andre
Andre Klapper | Wikimedia Bugwrangler http://blogs.gnome.org/aklapper/
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
<quote name="Pine W" date="2015-02-05" time="16:32:15 -0800">
Interesting. I thought that WMF had full redundancy among at least two physical sites for tech infrastructure, in case any location went completely offline, such as if there is a fire or flood. Is this not the case, and if so, is full redundancy capability planned?
We only have one primary datacenter right now ("EQIAD" in Ashburn, VA), the others ("ULSFO" in SF, "ESAMS" in Amsterdam) are just cache centers.
There is a new datacenter coming online which will be a full fledged DC along with EQIAD, this one is called CODFW (in Dallas, TX). That DC is not yet operational (machines are still being installed/etc) and after it is there will still be work to make MediaWiki the software fully multidatacenter aware.
Greg
I see, thanks.
Pine
*This is an Encyclopedia* https://www.wikipedia.org/
*One gateway to the wide garden of knowledge, where lies The deep rock of our past, in which we must delve The well of our future,The clear water we must leave untainted for those who come after us,The fertile earth, in which truth may grow in bright places, tended by many hands,And the broad fall of sunshine, warming our first steps toward knowing how much we do not know.*
*—Catherine Munro*
On Thu, Feb 5, 2015 at 4:42 PM, Greg Grossmeier greg@wikimedia.org wrote:
<quote name="Pine W" date="2015-02-05" time="16:32:15 -0800"> > Interesting. I thought that WMF had full redundancy among at least two > physical sites for tech infrastructure, in case any location went > completely offline, such as if there is a fire or flood. Is this not the > case, and if so, is full redundancy capability planned?
We only have one primary datacenter right now ("EQIAD" in Ashburn, VA), the others ("ULSFO" in SF, "ESAMS" in Amsterdam) are just cache centers.
There is a new datacenter coming online which will be a full fledged DC along with EQIAD, this one is called CODFW (in Dallas, TX). That DC is not yet operational (machines are still being installed/etc) and after it is there will still be work to make MediaWiki the software fully multidatacenter aware.
Greg
-- | Greg Grossmeier GPG: B2FA 27B1 F7EB D327 6B8E | | identi.ca: @greg A18D 1138 8E47 FAC8 1C7D |
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
wikitech-l@lists.wikimedia.org