On Sat, Nov 01, 2003 at 05:13:43PM +0100, Lars Aronsson wrote:
Jens Frank wrote:
I took the number "90%" as an example. You can make it 99% if you like,
No, I don't like to "make it" any number at all. I prefer to base my opinions on observations from real, running systems. In my experience, load balancers (just like any Internet router) have near 0% downtime, at most one tenth of the web server applications involved, especially since the latter tend to be in constant development.
Yes, I agree, I never saw one of our Cisco load balancers go down due to hardware failure. What James proposed was using a Linux box. In contrast to the Cisco, this one will not have redundant power supply and will have moving parts: fans and disk drives. And disks do fail.
The next point is detecting the failure of a node in the cluster. Apache port responding/not responding is easy to detect. But is the information received correct? Is the server in sync? Is the server able to connect the database? Is the image directory up to date? This is the area where I saw load balancers fail.
If you can report differing experience, I would listen to your arguments. But if all you can produce is various guesses in the 90-99 % range, this becomes pointless. Did you ever buy a Cisco router that had 99% availability?
Yes. But Cisco gave us a new one to replace it. And we're not talking about a Cisco load balancer, those are pretty expensive toys.
I just want to point out that availability will not increase.
I understand that this is what you want, but I still think you are wrong.
Just to tease everybody, here is the corresponding table for http://susning.nu/Sverige (a 44 kbyte page):
Week Beginning Downtime Slowness Avg access time
2003-w44 27 Oct 2003 1 % 1 % 0.66 seconds 2003-w43 20 Oct 2003 1 % 0 % 0.31 2003-w42 13 Oct 2003 1 % 2 % 0.73 2003-w41 6 Oct 2003 2 % 2 % 0.66 2003-w40 29 Sep 2003 0 % 1 % 0.32 2003-w39 22 Sep 2003 0 % 0 % 0.25 2003-w38 15 Sep 2003 0 % 15 % 2.24 2003-w37 8 Sep 2003 0 % 84 % 9.72 (oops!) 2003-w36 1 Sep 2003 0 % 2 % 0.50 2003-w35 25 Aug 2003 0 % 3 % 1.19 2003-w34 18 Aug 2003 0 % 2 % 0.60 2003-w33 11 Aug 2003 0 % 4 % 0.75 2003-w32 4 Aug 2003 0 % 1 % 0.26 2003-w31 28 Jul 2003 0 % 0 % 0.63 2003-w30 21 Jul 2001 0 % 5 % 0.92
Once again, these are my observations, not neutral facts. If you have observations that differ significantly from these, please tell me.
OK, several possible conclusions: - Susning might have less hits than wikipedia - Susning's software might be better than the current MediaWiki release - Susning might be running on a hell of a machine - ....
The only thing you prove by these figures is that Susning is faster. And I agree that this has to change. Wikipedia should be as fast as Susning. Clustering is the way to achieve this. But several different ways to implement a cluster are available. The classical ones are * a load balancer, either special hardware or a routing software running on a normal server, in front of the web servers or * a cluster software like "heartbeat", "Sun Cluster" or "IBM HACMP" running on the web server nodes, taking over services when the other cluster partner dies.
Regards,
JeLuF