On Sat, Nov 01, 2003 at 05:13:43PM +0100, Lars Aronsson wrote:
Jens Frank wrote:
I took the number "90%" as an example.
You can make it 99% if you like,
No, I don't like to "make it" any number at all. I prefer to base
my opinions on observations from real, running systems. In my
experience, load balancers (just like any Internet router) have near
0% downtime, at most one tenth of the web server applications
involved, especially since the latter tend to be in constant
development.
Yes, I agree, I never saw one of our Cisco load balancers go down
due to hardware failure. What James proposed was using a Linux box. In
contrast to the Cisco, this one will not have redundant power supply
and will have moving parts: fans and disk drives. And disks do fail.
The next point is detecting the failure of a node in the cluster.
Apache port responding/not responding is easy to detect. But is
the information received correct? Is the server in sync? Is the
server able to connect the database? Is the image directory up
to date? This is the area where I saw load balancers fail.
If you can report differing experience, I would
listen
to your arguments. But if all you can produce is various guesses in
the 90-99 % range, this becomes pointless. Did you ever buy a Cisco
router that had 99% availability?
Yes. But Cisco gave us a new one to replace it. And we're not talking
about a Cisco load balancer, those are pretty expensive toys.
I just want to
point out that availability will not increase.
I understand that this is what you want, but I still think you are
wrong.
Just to tease everybody, here is the corresponding table for
http://susning.nu/Sverige (a 44 kbyte page):
Week Beginning Downtime Slowness Avg access time
-------- ----------- -------- -------- ---------------
2003-w44 27 Oct 2003 1 % 1 % 0.66 seconds
2003-w43 20 Oct 2003 1 % 0 % 0.31
2003-w42 13 Oct 2003 1 % 2 % 0.73
2003-w41 6 Oct 2003 2 % 2 % 0.66
2003-w40 29 Sep 2003 0 % 1 % 0.32
2003-w39 22 Sep 2003 0 % 0 % 0.25
2003-w38 15 Sep 2003 0 % 15 % 2.24
2003-w37 8 Sep 2003 0 % 84 % 9.72 (oops!)
2003-w36 1 Sep 2003 0 % 2 % 0.50
2003-w35 25 Aug 2003 0 % 3 % 1.19
2003-w34 18 Aug 2003 0 % 2 % 0.60
2003-w33 11 Aug 2003 0 % 4 % 0.75
2003-w32 4 Aug 2003 0 % 1 % 0.26
2003-w31 28 Jul 2003 0 % 0 % 0.63
2003-w30 21 Jul 2001 0 % 5 % 0.92
Once again, these are my observations, not neutral facts. If you have
observations that differ significantly from these, please tell me.
OK, several possible conclusions:
- Susning might have less hits than wikipedia
- Susning's software might be better than the current MediaWiki release
- Susning might be running on a hell of a machine
- ....
The only thing you prove by these figures is that Susning is faster.
And I agree that this has to change. Wikipedia should be as fast
as Susning. Clustering is the way to achieve this. But several
different ways to implement a cluster are available. The classical
ones are
* a load balancer, either special hardware or a routing software running
on a normal server, in front of the web servers or
* a cluster software like "heartbeat", "Sun Cluster" or "IBM
HACMP"
running on the web server nodes, taking over services when the other
cluster partner dies.
Regards,
JeLuF