Outage

List overview All Threads
Download

newer

older

Next step in the spam arms race.

Re: [Wikitech-l] Writing a patch...

Brion Vibber

10 Apr 2006 10 Apr '06

1:03 a.m.

PowerMedium had some sort of network and/or power outage for about an hour from circa 19:20 UTC. We've been working on getting things back online since power and network became available again.

Will post more about cause when we know more.

We've got a guy at the colo helping out and Kyle's in there also. Unless there are other major problems we should be back online soon.

-- brion vibber (brion @ pobox.com)

Attachments:

signature.asc (application/pgp-signature — 249 bytes)

Show replies by date

Brion Vibber

10 Apr 10 Apr

1:27 a.m.

Brion Vibber wrote:

...

PowerMedium had some sort of network and/or power outage for about an hour from circa 19:20 UTC. We've been working on getting things back online since power and network became available again.

Will post more about cause when we know more.

Apparently an 800-amp breaker in the main PDU failed.

The database servers are currently starting up; this takes a little while, but hopefully should be all done within the hour.

-- brion vibber (brion @ pobox.com)

Minh Nguyen

2 a.m.

Brion Vibber wrote:

...

Brion Vibber wrote:

...
PowerMedium had some sort of network and/or power outage for about an hour from circa 19:20 UTC. We've been working on getting things back online since power and network became available again.

Will post more about cause when we know more.

Apparently an 800-amp breaker in the main PDU failed.

The database servers are currently starting up; this takes a little while, but hopefully should be all done within the hour.

-- brion vibber (brion @ pobox.com)

Don't we have some sort of "power failures corrupt absolutely" message to put up for occasions like this? :P

-- Minh Nguyen mxn@zoomtown.com AIM: trycom2000; Jabber: mxn@myjabber.net; Blog: http://mxn.f2o.org/

Daniel Mayer

3:10 a.m.

--- Brion Vibber brion@pobox.com wrote:

...

Brion Vibber wrote:

...
PowerMedium had some sort of network and/or power outage for about an hour from circa 19:20 UTC. We've been working on getting things back online since power and network became available again.

Will post more about cause when we know more.

Apparently an 800-amp breaker in the main PDU failed.

The database servers are currently starting up; this takes a little while, but hopefully should be all done within the hour.

I thought each db server was on a UPS? If not, then why not? I also thought we had the capability to go read only from the other server farms when the Tampa farm went down. If not, then why not?

We are a top 20 website that makes more than enough money to hire enough people to make sure services decay gracefully and to make sure we have adequate redundancy. I'm working on the current year budget. Please start thinking about your staffing and equipment needs for the rest of the year.

-- mav

__________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com

Brion Vibber

3:25 a.m.

Daniel Mayer wrote:

...

I thought each db server was on a UPS? If not, then why not?

As I understand it, our whole rack has two UPSs. The breaker that went is the one that feeds power *from* those UPSs *to* the racks.

It's specced way over what should be needed there; according to bw at the colo they've got the company running diagnostics on it as it shouldn't have been able to blow.

...

I also thought we had the capability to go read only from the other server farms when the Tampa farm went down. If not, then why not?

No, we've not had this capability so far.

...

We are a top 20 website that makes more than enough money to hire enough people to make sure services decay gracefully and to make sure we have adequate redundancy. I'm working on the current year budget. Please start thinking about your staffing and equipment needs for the rest of the year.

If you've got the budget now we'll be happy to spend it. But for a "top 20 website" we've still got a piddly little budget and a single employee. Sites with comparable traffic sell for many many millions...

-- brion vibber (brion @ pobox.com)

Gregory Maxwell

3:28 a.m.

On 4/9/06, Daniel Mayer maveric149@yahoo.com wrote:

...

I thought each db server was on a UPS? If not, then why not? I also thought we had the capability to go read only from the other server farms when the Tampa farm went down. If not, then why not?

We are a top 20 website that makes more than enough money to hire enough people to make sure services decay gracefully and to make sure we have adequate redundancy. I'm working on the current year budget. Please start thinking about your staffing and equipment needs for the rest of the year.

In any decent facility house power should be more reliable than a bunch of small UPSes stuck in your racks. MTBF on small UPSes is not very impressive.

Neil Harris

3:59 a.m.

Gregory Maxwell wrote:

...

On 4/9/06, Daniel Mayer maveric149@yahoo.com wrote:

...
I thought each db server was on a UPS? If not, then why not? I also thought we had the capability to go read only from the other server farms when the Tampa farm went down. If not, then why not?

We are a top 20 website that makes more than enough money to hire enough people to make sure services decay gracefully and to make sure we have adequate redundancy. I'm working on the current year budget. Please start thinking about your staffing and equipment needs for the rest of the year.

In any decent facility house power should be more reliable than a bunch of small UPSes stuck in your racks. MTBF on small UPSes is not very impressive.

On the other hand, small UPS's generally do not all fail at once.

In most big facilities, there are generally two power chains, A and B; the only time both should go off is in the case of a dire emergency that requires a manually forced power cut. Were the database servers dual-powered from independent A and B side power supplies, or is there only a single house feed to each DB server?

-- Neil

Brion Vibber

4:48 a.m.

Brion Vibber wrote:

...

Apparently an 800-amp breaker in the main PDU failed.

The database servers are currently starting up; this takes a little while, but hopefully should be all done within the hour.

As always this step took a touch longer than hoped. :P

Right now most stuff seems to be back up and running: viewing, editing, images, IRC feeds.

Editing-time outage: 6 hours, 15 minutes.

-- brion vibber (brion @ pobox.com)

Domas Mituzas

10:27 a.m.

Hi!

...

Right now most stuff seems to be back up and running: viewing, editing, images, IRC feeds.

Editing-time outage: 6 hours, 15 minutes.

Now as we have enough of money to perform better, we can start planning.

First of all, for 24/7 operation we'd need to have 9 fully qualified system administrators ( for every 8h timeslot with redundancy, leave/ sick coverage, etc) (payroll costs ~100,000$ each).

Of course as these guys would be working on site operations, they'd not have any time for development. So we need >100 guys outsourced to India (payroll costs ~1000$ each).

Sure, as open source technologies are not that suitable for round the clock operation, we might have to start using either Java (enterprise!!!) or .Net/C# (on Microsoft platform, sure, enterprise again!!!).

Then, datacenters. To have a reasonably working read-only stand-by datacenter, we'd have to invest ~250k$ in it (with current load).

We should of course have 3rd party consultants to help this set up (+100,000$ one time, one week gig).

Moreover, we should invest into security measures (as wikis are often hacked). We'd need to screen traffic, ensure highly secure AAA subsystems, uh oh.

I guess other developers (especially with experience in high- availability environments) could add more requirements for fluent operation!

So... let's start rolling!!!

Domas

Mark Williamson

10:33 a.m.

lol

On 10/04/06, Domas Mituzas midom.lists@gmail.com wrote:

...

Hi!

...
Right now most stuff seems to be back up and running: viewing, editing, images, IRC feeds.

Editing-time outage: 6 hours, 15 minutes.

Now as we have enough of money to perform better, we can start planning.

First of all, for 24/7 operation we'd need to have 9 fully qualified system administrators ( for every 8h timeslot with redundancy, leave/ sick coverage, etc) (payroll costs ~100,000$ each).

Of course as these guys would be working on site operations, they'd not have any time for development. So we need >100 guys outsourced to India (payroll costs ~1000$ each).

Sure, as open source technologies are not that suitable for round the clock operation, we might have to start using either Java (enterprise!!!) or .Net/C# (on Microsoft platform, sure, enterprise again!!!).

Then, datacenters. To have a reasonably working read-only stand-by datacenter, we'd have to invest ~250k$ in it (with current load).

We should of course have 3rd party consultants to help this set up (+100,000$ one time, one week gig).

Moreover, we should invest into security measures (as wikis are often hacked). We'd need to screen traffic, ensure highly secure AAA subsystems, uh oh.

I guess other developers (especially with experience in high- availability environments) could add more requirements for fluent operation!

So... let's start rolling!!!

Domas _______________________________________________ Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l

-- "Take away their language, destroy their souls." -- Joseph Stalin

Jay R. Ashworth

4:25 p.m.

On Mon, Apr 10, 2006 at 10:27:23AM +0300, Domas Mituzas wrote:

...

Now as we have enough of money to perform better, we can start planning.

First of all, for 24/7 operation we'd need to have 9 fully qualified system administrators ( for every 8h timeslot with redundancy, leave/ sick coverage, etc) (payroll costs ~100,000$ each).

Of course as these guys would be working on site operations, they'd not have any time for development. So we need >100 guys outsourced to India (payroll costs ~1000$ each).

[ ... ]

...

I guess other developers (especially with experience in high- availability environments) could add more requirements for fluent operation!

Hee.

On a serious note, though, if a backup site is beeing looked at, we might want to talk to the DirectNIC people in New Orleans...

Cheers, -- jra

-- Jay R. Ashworth jra@baylink.com Designer Baylink RFC 2100 Ashworth & Associates The Things I Think '87 e24 St Petersburg FL USA http://baylink.pitas.com +1 727 647 1274 A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing? A: Top-posting. Q: What is the most annoying thing on Usenet and in e-mail?

jf＠mormo.org

5:37 p.m.

On Mon, Apr 10, 2006 at 09:25:35AM -0400, Jay R. Ashworth wrote:

...

On Mon, Apr 10, 2006 at 10:27:23AM +0300, Domas Mituzas wrote:

Hee.

On a serious note, though, if a backup site is beeing looked at, we might want to talk to the DirectNIC people in New Orleans...

Why? What's special about them?

Regards,

jens

Jay R. Ashworth

5:38 p.m.

On Mon, Apr 10, 2006 at 04:37:35PM +0200, Jens Frank wrote:

...

...
On a serious note, though, if a backup site is beeing looked at, we might want to talk to the DirectNIC people in New Orleans...

Why? What's special about them?

Well, they managed to stay on the air during the near total destruction of the New Orleans metropolitan area; that's got to be good for *something*....

Cheers, -- jra

Daniel Mayer

2:52 a.m.

New subject: Where the hell is the donation form? (was Re: Outage)

--- Brion Vibber brion@pobox.com wrote:

...

PowerMedium had some sort of network and/or power outage for about an hour from circa 19:20 UTC. We've been working on getting things back online since power and network became available again.

Will post more about cause when we know more.

We've got a guy at the colo helping out and Kyle's in there also. Unless there are other major problems we should be back online soon.

Argh - For the 20th time over the last 3 years: CAN SOMEBODY PLEASE PUT A DONATION FORM ON THE WIKIDOWN PAGE!!!

Presenting tens of thousands of people with a donation link that WE KNOW WILL NOT WORK WHEN THE MESSAGE IS DISPLAYED not only looks stupid, but is preventing many thousands of potential donations.

Please, do something: we are pissing away thousands of dollars of potential donations and looking like fools in the process.

Also, if we have a donation form, then these outages would be mostly self-correcting since the extra revenue will help us to add more redundancy to the system so we can at least go read only until we are completely back up.

-- mav

__________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com

Brion Vibber

3:27 a.m.

New subject: Where the hell is the donation form? (was Re: Outage)

Daniel Mayer wrote:

...

Argh - For the 20th time over the last 3 years: CAN SOMEBODY PLEASE PUT A DONATION FORM ON THE WIKIDOWN PAGE!!!

When I am finally successful at wheedling the secret method of updating the error pages from the unpaid volunteers who run the cache servers, we'll see what we can do.

...

Also, if we have a donation form, then these outages would be mostly self-correcting since the extra revenue will help us to add more redundancy to the system so we can at least go read only until we are completely back up.

It generally is considered to look very bad, though: "begging for money" or "holding the site ransom".

-- brion vibber (brion @ pobox.com)

6807

Age (days ago)

6808

Last active (days ago)

wikitech-l@lists.wikimedia.org

14 comments

9 participants

tags (0)

participants (9)

Brion Vibber
Daniel Mayer
Domas Mituzas
Gregory Maxwell
Jay R. Ashworth
jf＠mormo.org
Mark Williamson
Minh Nguyen
Neil Harris