-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
hi,
due to a failed linecard in the switch at knams, yarrow.toolserver.org, the database replica for s1 (enwiki) and s3, is currently offline. this means s1 and s3 clusters are inaccessible from all servers (including stable). hopefully someone will be able to visit the colo tomorrow and connect yarrow to a different card until the failed module is replaced.
- river.
I thought the hardware failure was on a wikimedia server.
On Sun, Sep 7, 2008 at 2:42 AM, River Tarnell river@wikimedia.org wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
hi,
due to a failed linecard in the switch at knams, yarrow.toolserver.org, the database replica for s1 (enwiki) and s3, is currently offline. this means s1 and s3 clusters are inaccessible from all servers (including stable). hopefully someone will be able to visit the colo tomorrow and connect yarrow to a different card until the failed module is replaced.
- river.
-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.8 (SunOS)
iEYEARECAAYFAkjDFWoACgkQIXd7fCuc5vIHmQCgrYB2qGCX+Q9yePDUfNpTlIgu 4mEAoIR6ZWnJtFVmave88YJ7/xkRnSCH =uPU+ -----END PGP SIGNATURE-----
Toolserver-announce mailing list Toolserver-announce@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/toolserver-announce
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
White Cat:
I thought the hardware failure was on a wikimedia server.
no, as i explained, the hardware failure was on the switch. both Wikimedia and Toolserver systems are connected to this switch.
- river.
What happened to redundancy?
Stwalkerster
2008/9/7 River Tarnell river@wikimedia.org
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
White Cat:
I thought the hardware failure was on a wikimedia server.
no, as i explained, the hardware failure was on the switch. both Wikimedia and Toolserver systems are connected to this switch.
- river.
-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.8 (SunOS)
iEYEARECAAYFAkjDLg0ACgkQIXd7fCuc5vISuQCghFgVmvbL/S4B7hnRPACfHGuD G6cAn1kJ5D3aVUYSbtvynDhcRuFWpjxv =spyE -----END PGP SIGNATURE-----
Toolserver-l mailing list Toolserver-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/toolserver-l
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Simon Walker:
What happened to redundancy?
i'm not sure what you're asking here; there has never been any redundancy for host network connections, or for toolserver database replicas, so nothing happened to that - it wasn't there in the first place.
if you're suggesting redundancy should be added - yes, that would be nice, except that to duplicate 3 database servers would cost around EUR20,000, and even that wouldn't help if the network problem was elsewhere (for example, if our transit connections had been on the card that failed, everything would have been offline, not just one server).
perhaps we could duplicate the entire toolserver setup at another location. that would cost around EUR30,000 in hardware, as well as monthly rental and transit (for at least 16RU space, which wouldn't be cheap, and a fair amount of bandwidth), and would significantly increase the administration effort needed, when we hardly have enough time to maintain the current set of servers. and what would it save - a few hours downtime in the rather unlikely event of a linecard failure?
- river.
2008/9/7 River Tarnell river@wikimedia.org
i'm not sure what you're asking here; there has never been any redundancy for host network connections, or for toolserver database replicas, so nothing happened to that - it wasn't there in the first place.
if you're suggesting redundancy should be added - yes, that would be nice, except that to duplicate 3 database servers would cost around EUR20,000, and even that wouldn't help if the network problem was elsewhere (for example, if our transit connections had been on the card that failed, everything would have been offline, not just one server).
perhaps we could duplicate the entire toolserver setup at another location. that would cost around EUR30,000 in hardware, as well as monthly rental and transit (for at least 16RU space, which wouldn't be cheap, and a fair amount of bandwidth), and would significantly increase the administration effort needed, when we hardly have enough time to maintain the current set of servers. and what would it save - a few hours downtime in the rather unlikely event of a linecard failure?
I was actually talking about connecting the servers to the net down different routes, ie multiple linecards, so if one went down, the other would be there to take the load. Replicating the entire toolserver would be way too much bother and expense, I agree, but surely another connection can't hurt too much?
(apologies if I've got completely the wrong end of the stick here)
Stwalkerster
On Sun, Sep 7, 2008 at 5:08 PM, Simon Walker stwalkerster@googlemail.com wrote:
I was actually talking about connecting the servers to the net down different routes, ie multiple linecards, so if one went down, the other would be there to take the load. Replicating the entire toolserver would be way too much bother and expense, I agree, but surely another connection can't hurt too much?
(apologies if I've got completely the wrong end of the stick here)
Doubling the cost of switch ports (this isn't a $50 netgear we're talking about here…) to guard against some hours outage less-essential service against a failure mode that has only happened once in many years?
Doesn't sound like much of a win.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Simon Walker:
I was actually talking about connecting the servers to the net down different routes, ie multiple linecards
an additional linecard is still several k$, money which could be better spent on more servers, i think.
over the next few months some changes will be made to the network setup at knams, which might make it easier to provide some additional redundancy without spending so much (because we will no longer connect the entire toolserver to the main Wikimedia switch).
- river.
an additional linecard is still several k$, money which could be better spent on more servers, i think.
Fair enough, I didn't think to would be that expensive, $100 at most.
over the next few months some changes will be made to the network setup at knams, which might make it easier to provide some additional redundancy without spending so much (because we will no longer connect the entire toolserver to the main Wikimedia switch).
Sounds good. Thanks for the clarification
Simon
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
River Tarnell:
due to a failed linecard in the switch at knams, yarrow.toolserver.org, the database replica for s1 (enwiki) and s3, is currently offline.
as people probably noticed, this problem is now resolved (by moving yarrow to a different linecard).
- river.
toolserver-l@lists.wikimedia.org