PowerMedium had some sort of network and/or power outage for about an hour from circa 19:20 UTC. We've been working on getting things back online since power and network became available again.
Will post more about cause when we know more.
We've got a guy at the colo helping out and Kyle's in there also. Unless there are other major problems we should be back online soon.
-- brion vibber (brion @ pobox.com)
Brion Vibber wrote:
PowerMedium had some sort of network and/or power outage for about an hour from circa 19:20 UTC. We've been working on getting things back online since power and network became available again.
Will post more about cause when we know more.
Apparently an 800-amp breaker in the main PDU failed.
The database servers are currently starting up; this takes a little while, but hopefully should be all done within the hour.
-- brion vibber (brion @ pobox.com)
Brion Vibber wrote:
Brion Vibber wrote:
PowerMedium had some sort of network and/or power outage for about an hour from circa 19:20 UTC. We've been working on getting things back online since power and network became available again.
Will post more about cause when we know more.
Apparently an 800-amp breaker in the main PDU failed.
The database servers are currently starting up; this takes a little while, but hopefully should be all done within the hour.
-- brion vibber (brion @ pobox.com)
Don't we have some sort of "power failures corrupt absolutely" message to put up for occasions like this? :P
--- Brion Vibber brion@pobox.com wrote:
Brion Vibber wrote:
PowerMedium had some sort of network and/or power outage for about an hour from circa 19:20 UTC. We've been working on getting things back online since power and network became available again.
Will post more about cause when we know more.
Apparently an 800-amp breaker in the main PDU failed.
The database servers are currently starting up; this takes a little while, but hopefully should be all done within the hour.
I thought each db server was on a UPS? If not, then why not? I also thought we had the capability to go read only from the other server farms when the Tampa farm went down. If not, then why not?
We are a top 20 website that makes more than enough money to hire enough people to make sure services decay gracefully and to make sure we have adequate redundancy. I'm working on the current year budget. Please start thinking about your staffing and equipment needs for the rest of the year.
-- mav
__________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Daniel Mayer wrote:
I thought each db server was on a UPS? If not, then why not?
As I understand it, our whole rack has two UPSs. The breaker that went is the one that feeds power *from* those UPSs *to* the racks.
It's specced way over what should be needed there; according to bw at the colo they've got the company running diagnostics on it as it shouldn't have been able to blow.
I also thought we had the capability to go read only from the other server farms when the Tampa farm went down. If not, then why not?
No, we've not had this capability so far.
We are a top 20 website that makes more than enough money to hire enough people to make sure services decay gracefully and to make sure we have adequate redundancy. I'm working on the current year budget. Please start thinking about your staffing and equipment needs for the rest of the year.
If you've got the budget now we'll be happy to spend it. But for a "top 20 website" we've still got a piddly little budget and a single employee. Sites with comparable traffic sell for many many millions...
-- brion vibber (brion @ pobox.com)
On 4/9/06, Daniel Mayer maveric149@yahoo.com wrote:
I thought each db server was on a UPS? If not, then why not? I also thought we had the capability to go read only from the other server farms when the Tampa farm went down. If not, then why not?
We are a top 20 website that makes more than enough money to hire enough people to make sure services decay gracefully and to make sure we have adequate redundancy. I'm working on the current year budget. Please start thinking about your staffing and equipment needs for the rest of the year.
In any decent facility house power should be more reliable than a bunch of small UPSes stuck in your racks. MTBF on small UPSes is not very impressive.
Gregory Maxwell wrote:
On 4/9/06, Daniel Mayer maveric149@yahoo.com wrote:
I thought each db server was on a UPS? If not, then why not? I also thought we had the capability to go read only from the other server farms when the Tampa farm went down. If not, then why not?
We are a top 20 website that makes more than enough money to hire enough people to make sure services decay gracefully and to make sure we have adequate redundancy. I'm working on the current year budget. Please start thinking about your staffing and equipment needs for the rest of the year.
In any decent facility house power should be more reliable than a bunch of small UPSes stuck in your racks. MTBF on small UPSes is not very impressive.
On the other hand, small UPS's generally do not all fail at once.
In most big facilities, there are generally two power chains, A and B; the only time both should go off is in the case of a dire emergency that requires a manually forced power cut. Were the database servers dual-powered from independent A and B side power supplies, or is there only a single house feed to each DB server?
-- Neil
Brion Vibber wrote:
Apparently an 800-amp breaker in the main PDU failed.
The database servers are currently starting up; this takes a little while, but hopefully should be all done within the hour.
As always this step took a touch longer than hoped. :P
Right now most stuff seems to be back up and running: viewing, editing, images, IRC feeds.
Editing-time outage: 6 hours, 15 minutes.
-- brion vibber (brion @ pobox.com)
Hi!
Right now most stuff seems to be back up and running: viewing, editing, images, IRC feeds.
Editing-time outage: 6 hours, 15 minutes.
Now as we have enough of money to perform better, we can start planning.
First of all, for 24/7 operation we'd need to have 9 fully qualified system administrators ( for every 8h timeslot with redundancy, leave/ sick coverage, etc) (payroll costs ~100,000$ each).
Of course as these guys would be working on site operations, they'd not have any time for development. So we need >100 guys outsourced to India (payroll costs ~1000$ each).
Sure, as open source technologies are not that suitable for round the clock operation, we might have to start using either Java (enterprise!!!) or .Net/C# (on Microsoft platform, sure, enterprise again!!!).
Then, datacenters. To have a reasonably working read-only stand-by datacenter, we'd have to invest ~250k$ in it (with current load).
We should of course have 3rd party consultants to help this set up (+100,000$ one time, one week gig).
Moreover, we should invest into security measures (as wikis are often hacked). We'd need to screen traffic, ensure highly secure AAA subsystems, uh oh.
I guess other developers (especially with experience in high- availability environments) could add more requirements for fluent operation!
So... let's start rolling!!!
Domas
lol
On 10/04/06, Domas Mituzas midom.lists@gmail.com wrote:
Hi!
Right now most stuff seems to be back up and running: viewing, editing, images, IRC feeds.
Editing-time outage: 6 hours, 15 minutes.
Now as we have enough of money to perform better, we can start planning.
First of all, for 24/7 operation we'd need to have 9 fully qualified system administrators ( for every 8h timeslot with redundancy, leave/ sick coverage, etc) (payroll costs ~100,000$ each).
Of course as these guys would be working on site operations, they'd not have any time for development. So we need >100 guys outsourced to India (payroll costs ~1000$ each).
Sure, as open source technologies are not that suitable for round the clock operation, we might have to start using either Java (enterprise!!!) or .Net/C# (on Microsoft platform, sure, enterprise again!!!).
Then, datacenters. To have a reasonably working read-only stand-by datacenter, we'd have to invest ~250k$ in it (with current load).
We should of course have 3rd party consultants to help this set up (+100,000$ one time, one week gig).
Moreover, we should invest into security measures (as wikis are often hacked). We'd need to screen traffic, ensure highly secure AAA subsystems, uh oh.
I guess other developers (especially with experience in high- availability environments) could add more requirements for fluent operation!
So... let's start rolling!!!
Domas _______________________________________________ Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l
-- "Take away their language, destroy their souls." -- Joseph Stalin
On Mon, Apr 10, 2006 at 10:27:23AM +0300, Domas Mituzas wrote:
Now as we have enough of money to perform better, we can start planning.
First of all, for 24/7 operation we'd need to have 9 fully qualified system administrators ( for every 8h timeslot with redundancy, leave/ sick coverage, etc) (payroll costs ~100,000$ each).
Of course as these guys would be working on site operations, they'd not have any time for development. So we need >100 guys outsourced to India (payroll costs ~1000$ each).
[ ... ]
I guess other developers (especially with experience in high- availability environments) could add more requirements for fluent operation!
Hee.
On a serious note, though, if a backup site is beeing looked at, we might want to talk to the DirectNIC people in New Orleans...
Cheers, -- jra
On Mon, Apr 10, 2006 at 09:25:35AM -0400, Jay R. Ashworth wrote:
On Mon, Apr 10, 2006 at 10:27:23AM +0300, Domas Mituzas wrote:
Hee.
On a serious note, though, if a backup site is beeing looked at, we might want to talk to the DirectNIC people in New Orleans...
Why? What's special about them?
Regards,
jens
On Mon, Apr 10, 2006 at 04:37:35PM +0200, Jens Frank wrote:
On a serious note, though, if a backup site is beeing looked at, we might want to talk to the DirectNIC people in New Orleans...
Why? What's special about them?
Well, they managed to stay on the air during the near total destruction of the New Orleans metropolitan area; that's got to be good for *something*....
Cheers, -- jra
--- Brion Vibber brion@pobox.com wrote:
PowerMedium had some sort of network and/or power outage for about an hour from circa 19:20 UTC. We've been working on getting things back online since power and network became available again.
Will post more about cause when we know more.
We've got a guy at the colo helping out and Kyle's in there also. Unless there are other major problems we should be back online soon.
Argh - For the 20th time over the last 3 years: CAN SOMEBODY PLEASE PUT A DONATION FORM ON THE WIKIDOWN PAGE!!!
Presenting tens of thousands of people with a donation link that WE KNOW WILL NOT WORK WHEN THE MESSAGE IS DISPLAYED not only looks stupid, but is preventing many thousands of potential donations.
Please, do something: we are pissing away thousands of dollars of potential donations and looking like fools in the process.
Also, if we have a donation form, then these outages would be mostly self-correcting since the extra revenue will help us to add more redundancy to the system so we can at least go read only until we are completely back up.
-- mav
__________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Daniel Mayer wrote:
Argh - For the 20th time over the last 3 years: CAN SOMEBODY PLEASE PUT A DONATION FORM ON THE WIKIDOWN PAGE!!!
When I am finally successful at wheedling the secret method of updating the error pages from the unpaid volunteers who run the cache servers, we'll see what we can do.
Also, if we have a donation form, then these outages would be mostly self-correcting since the extra revenue will help us to add more redundancy to the system so we can at least go read only until we are completely back up.
It generally is considered to look very bad, though: "begging for money" or "holding the site ransom".
-- brion vibber (brion @ pobox.com)
wikitech-l@lists.wikimedia.org