Load balancing

List overview All Threads
Download

newer

older

Afrikaans wikipedia - unable to...

Fw: Human Audio System Response to...

Jimmy Wales

31 Oct 2003 31 Oct '03

8:18 p.m.

What are our plans for the network architecture as soon as the new server arrives (sometime next week)?

At that time, we hope that larousse and pliny will both be successfully upgraded (though I believe that Jason has not yet successfully gotten 4 gig of RAM to work, but 2 gig each and dual Athlon 2800+ should be do-able) and equivalent to each other.

The DB server will be the DB server, that much we know for sure. :-)

Beyond that, I think that the easiest thing to do would be to have en served by one machine, and everything else by the other machine. Based on total article count, which is roughly comparable for en vs rest-of-the-world, that seems good, but is it really? What about traffic?

In the longer term, the right way to do this is not to load balance by domain names, but to load balance properly.

I have had very good success in the past using iptables and a configuration that looks a lot like this picture:

http://www.ultramonkey.org/2.0.1/topologies/lb-eg.html

Of course, I did this years ago, and the "poor man's" way -- I think there are probably packages (like ultramonkey!) that are quick solutions now.

The beauty of this kind of architecture is:

1. high availability -- if one webserving node falls over, traffic automatically goes to the ones that are still up

2. easy expandability -- just add more webservers, at $2000 a crack for 'good enough' machines, and install the software and there you go.

Anyhow, to really do something like this, we'd need one more machine, but it need not be very powerful, since it's only going to be doing NAT/IPTABLES stuff.

--Jimbo

Show replies by date

Brion Vibber

31 Oct 31 Oct

11:18 p.m.

On Friday, Oct 31, 2003, at 09:18 US/Pacific, Jimmy Wales wrote:

...

Beyond that, I think that the easiest thing to do would be to have en served by one machine, and everything else by the other machine. Based on total article count, which is roughly comparable for en vs rest-of-the-world, that seems good, but is it really? What about traffic?

The European languages peak during the English wiki's trough, and they almost vanish completely during the English wiki's peaks: http://en.wikipedia.org/stats/hourly_usage_200310.png http://de.wikipedia.org/stats/hourly_usage_200310.png

While there is some overlap in the fairly-heavy regions, there are some times where one server is much more heavily loaded than the other. The peaks could be flattened out by spreading them with a round-robin arrangement.

...

In the longer term, the right way to do this is not to load balance by domain names, but to load balance properly.

I have had very good success in the past using iptables and a configuration that looks a lot like this picture:

http://www.ultramonkey.org/2.0.1/topologies/lb-eg.html

Sounds workable.

-- brion vibber (brion @ pobox.com)

Jens Frank

1 Nov 1 Nov

12:11 p.m.

On Fri, Oct 31, 2003 at 09:18:43AM -0800, Jimmy Wales wrote:

...

I have had very good success in the past using iptables and a configuration that looks a lot like this picture:

http://www.ultramonkey.org/2.0.1/topologies/lb-eg.html

Of course, I did this years ago, and the "poor man's" way -- I think there are probably packages (like ultramonkey!) that are quick solutions now.

The beauty of this kind of architecture is:

high availability -- if one webserving node falls over, traffic

automatically goes to the ones that are still up

Well, the architecture shown does not increase availability since it adds a single point of failure. Let's assume a server's availability is 90% (It's higher, of course, but numbers will become too ugly for this example). Having one webserver and one database server the overall system availability would be 81%.

Clustering two web servers will increase the web servers availability to 99%. But now the system has three components: Load balancer (90%), Web servers (99%), database (90%). That's a total availability of 80.19%. Oops.

If we really would need to increase hardware availability, we would have to either have a cluster of load balancers or a cluster of webservers using round robin DNS and IP takeover.

...

easy expandability -- just add more webservers, at $2000 a crack

for 'good enough' machines, and install the software and there you go.

That's true, if TeX and image directory can be shared or something.

Regards,

JeLuF

Lars Aronsson

5:56 p.m.

Jens Frank wrote:

...

Clustering two web servers will increase the web servers availability to 99%. But now the system has three components: Load balancer (90%), Web servers (99%), database (90%). That's a total availability of 80.19%.

Jens, have you ever seen a real load balancer that had as little as 90% availability, or are you just dreaming up some numbers that will prove your prejudice?

Here are my recent observations for www.wikipedia.org/wiki/Sweden (slowness means response time was 5-60 seconds, downtime means no answer at all within 60 seconds, this page is circa 50 kbytes):

Week Beginning Downtime Slowness Avg access time -------- ----------- -------- -------- --------------- 2003-w44 27 Oct 2003 1 % 54 % 8.51 seconds 2003-w43 20 Oct 2003 1 % 39 % 6.75 2003-w42 13 Oct 2003 2 % 11 % 3.66 2003-w41 6 Oct 2003 2 % 22 % 3.42 2003-w40 29 Sep 2003 6 % 29 % 4.84 2003-w39 22 Sep 2003 5 % 26 % 4.06 2003-w38 15 Sep 2003 1 % 25 % 3.66 2003-w37 8 Sep 2003 5 % 36 % 4.90 2003-w36 1 Sep 2003 2 % 10 % 2.26 2003-w35 25 Aug 2003 0 % 1 % 1.42 2003-w34 18 Aug 2003 0 % 2 % 1.48 2003-w33 11 Aug 2003 0 % 12 % 2.41 2003-w32 4 Aug 2003 0 % 6 % 2.06 2003-w31 28 Jul 2003 1 % 2 % 1.52 2003-w30 21 Jul 2001 26 % 1 % 1.90

I think these numbers indicate that slowness (performance) is the problem, not low availability. The server is capable of delivering a response in 1.5 seconds, but in the last week prefers to add seven seconds of dead weight (dead wait, uhuh) to this.

If I could decide, we should build a time machine that brings us back to August. I guess this is vacation month with less traffic. Or is there any other explanation?

-- Lars Aronsson (lars@aronsson.se) Aronsson Datateknik - http://aronsson.se/

Jens Frank

6:31 p.m.

On Sat, Nov 01, 2003 at 03:56:37PM +0100, Lars Aronsson wrote:

...

Jens Frank wrote:

...
Clustering two web servers will increase the web servers availability to 99%. But now the system has three components: Load balancer (90%), Web servers (99%), database (90%). That's a total availability of 80.19%.

Jens, have you ever seen a real load balancer that had as little as 90% availability, or are you just dreaming up some numbers that will prove your prejudice?

Please try reading the entire mail. It says:

(It's higher, of course, but numbers will become too ugly for this example)

I took the number "90%" as an example. You can make it 99% if you like, the result remains the same: One loadbalancer in front of two webservers reduces availability. Simple maths.

I also think that load balancing a farm of web servers is the way to go, I just want to point out that availability will not increase.

Regards,

JeLuF

Nick Reinking

7:07 p.m.

...

I took the number "90%" as an example. You can make it 99% if you like, the result remains the same: One loadbalancer in front of two webservers reduces availability. Simple maths.

I also think that load balancing a farm of web servers is the way to go, I just want to point out that availability will not increase.

This is not only wrong, but silly at the same time. Load balancers have better availability than web servers - if they didn't, nobody would even bother with them. They'd just have crazy schemes where web servers automatically take over for each other when others go down.

Plus, consider that when you add a load balancer in front of a farm of web servers, you no longer care about the availability of the web servers - because when a web server goes down, the load balancer takes care of it. So, by your 90% rule of availability, given one load balancer and four web servers:

10% (load balancer downtime) + (.1 * .1 * .1 * .1 = .0001) = 10.0001% downtime. Not noticable over just the plain 10% of having just a web server. When you take into account the extra speed you get with the load balancer, plus the fact that load balancers are more reliable than web servers, it's a no brainer.

-- Nick Reinking -- eschewing obfuscation since 1981 -- Minneapolis, MN

Jens Frank

9:02 p.m.

On Sat, Nov 01, 2003 at 10:07:25AM -0600, Nick Reinking wrote:

...

...
I took the number "90%" as an example. You can make it 99% if you like, the result remains the same: One loadbalancer in front of two webservers reduces availability. Simple maths.

I also think that load balancing a farm of web servers is the way to go, I just want to point out that availability will not increase.

This is not only wrong, but silly at the same time. Load balancers have better availability than web servers - if they didn't, nobody would even bother with them. They'd just have crazy schemes where web servers automatically take over for each other when others go down.

Plus, consider that when you add a load balancer in front of a farm of web servers, you no longer care about the availability of the web servers - because when a web server goes down, the load balancer takes care of it. So, by your 90% rule of availability, given one load balancer and four web servers:

10% (load balancer downtime) + (.1 * .1 * .1 * .1 = .0001) = 10.0001% downtime. Not noticable over just the plain 10% of having just a web server. When you take into account the extra speed you get with the load balancer, plus the fact that load balancers are more reliable than web servers, it's a no brainer.

You got my point! Extra speed: yes. Extra reliability: Not really. Not if you add a single, non clustered load balancer.

And to strengthen it: Extra speed is what we need, here I agree completely with Lars.

Regards,

JeLuF

Nick Reinking

2 Nov 2 Nov

6:19 p.m.

...

...
10% (load balancer downtime) + (.1 * .1 * .1 * .1 = .0001) = 10.0001% downtime. Not noticable over just the plain 10% of having just a web server. When you take into account the extra speed you get with the load balancer, plus the fact that load balancers are more reliable than web servers, it's a no brainer.

You got my point! Extra speed: yes. Extra reliability: Not really. Not if you add a single, non clustered load balancer.

And to strengthen it: Extra speed is what we need, here I agree completely with Lars.

Well, I got your point, but you didn't get mine. What I say above is that reliability with load balancers with be worse (on average) if, and only if, load balancers had the same reliability as web servers. They don't - they're more reliable than web servers, and thus, our reliability should go up.

-- Nick Reinking -- eschewing obfuscation since 1981 -- Minneapolis, MN

Lars Aronsson

1 Nov 1 Nov

7:13 p.m.

Jens Frank wrote:

...

I took the number "90%" as an example. You can make it 99% if you like,

No, I don't like to "make it" any number at all. I prefer to base my opinions on observations from real, running systems. In my experience, load balancers (just like any Internet router) have near 0% downtime, at most one tenth of the web server applications involved, especially since the latter tend to be in constant development. If you can report differing experience, I would listen to your arguments. But if all you can produce is various guesses in the 90-99 % range, this becomes pointless. Did you ever buy a Cisco router that had 99% availability?

...

I just want to point out that availability will not increase.

I understand that this is what you want, but I still think you are wrong.

Just to tease everybody, here is the corresponding table for http://susning.nu/Sverige (a 44 kbyte page):

Week Beginning Downtime Slowness Avg access time -------- ----------- -------- -------- --------------- 2003-w44 27 Oct 2003 1 % 1 % 0.66 seconds 2003-w43 20 Oct 2003 1 % 0 % 0.31 2003-w42 13 Oct 2003 1 % 2 % 0.73 2003-w41 6 Oct 2003 2 % 2 % 0.66 2003-w40 29 Sep 2003 0 % 1 % 0.32 2003-w39 22 Sep 2003 0 % 0 % 0.25 2003-w38 15 Sep 2003 0 % 15 % 2.24 2003-w37 8 Sep 2003 0 % 84 % 9.72 (oops!) 2003-w36 1 Sep 2003 0 % 2 % 0.50 2003-w35 25 Aug 2003 0 % 3 % 1.19 2003-w34 18 Aug 2003 0 % 2 % 0.60 2003-w33 11 Aug 2003 0 % 4 % 0.75 2003-w32 4 Aug 2003 0 % 1 % 0.26 2003-w31 28 Jul 2003 0 % 0 % 0.63 2003-w30 21 Jul 2001 0 % 5 % 0.92

Once again, these are my observations, not neutral facts. If you have observations that differ significantly from these, please tell me.

-- Lars Aronsson (lars@aronsson.se) Aronsson Datateknik - http://aronsson.se/

Jens Frank

9:22 p.m.

On Sat, Nov 01, 2003 at 05:13:43PM +0100, Lars Aronsson wrote:

...

Jens Frank wrote:

...
I took the number "90%" as an example. You can make it 99% if you like,

No, I don't like to "make it" any number at all. I prefer to base my opinions on observations from real, running systems. In my experience, load balancers (just like any Internet router) have near 0% downtime, at most one tenth of the web server applications involved, especially since the latter tend to be in constant development.

Yes, I agree, I never saw one of our Cisco load balancers go down due to hardware failure. What James proposed was using a Linux box. In contrast to the Cisco, this one will not have redundant power supply and will have moving parts: fans and disk drives. And disks do fail.

The next point is detecting the failure of a node in the cluster. Apache port responding/not responding is easy to detect. But is the information received correct? Is the server in sync? Is the server able to connect the database? Is the image directory up to date? This is the area where I saw load balancers fail.

...

If you can report differing experience, I would listen to your arguments. But if all you can produce is various guesses in the 90-99 % range, this becomes pointless. Did you ever buy a Cisco router that had 99% availability?

Yes. But Cisco gave us a new one to replace it. And we're not talking about a Cisco load balancer, those are pretty expensive toys.

...

...
I just want to point out that availability will not increase.

I understand that this is what you want, but I still think you are wrong.

Just to tease everybody, here is the corresponding table for http://susning.nu/Sverige (a 44 kbyte page):

Week Beginning Downtime Slowness Avg access time

2003-w44 27 Oct 2003 1 % 1 % 0.66 seconds 2003-w43 20 Oct 2003 1 % 0 % 0.31 2003-w42 13 Oct 2003 1 % 2 % 0.73 2003-w41 6 Oct 2003 2 % 2 % 0.66 2003-w40 29 Sep 2003 0 % 1 % 0.32 2003-w39 22 Sep 2003 0 % 0 % 0.25 2003-w38 15 Sep 2003 0 % 15 % 2.24 2003-w37 8 Sep 2003 0 % 84 % 9.72 (oops!) 2003-w36 1 Sep 2003 0 % 2 % 0.50 2003-w35 25 Aug 2003 0 % 3 % 1.19 2003-w34 18 Aug 2003 0 % 2 % 0.60 2003-w33 11 Aug 2003 0 % 4 % 0.75 2003-w32 4 Aug 2003 0 % 1 % 0.26 2003-w31 28 Jul 2003 0 % 0 % 0.63 2003-w30 21 Jul 2001 0 % 5 % 0.92

Once again, these are my observations, not neutral facts. If you have observations that differ significantly from these, please tell me.

OK, several possible conclusions: - Susning might have less hits than wikipedia - Susning's software might be better than the current MediaWiki release - Susning might be running on a hell of a machine - ....

The only thing you prove by these figures is that Susning is faster. And I agree that this has to change. Wikipedia should be as fast as Susning. Clustering is the way to achieve this. But several different ways to implement a cluster are available. The classical ones are * a load balancer, either special hardware or a routing software running on a normal server, in front of the web servers or * a cluster software like "heartbeat", "Sun Cluster" or "IBM HACMP" running on the web server nodes, taking over services when the other cluster partner dies.

Regards,

JeLuF

Lars Aronsson

2 Nov 2 Nov

12:32 a.m.

Jens Frank wrote:

...

OK, several possible conclusions:

Susning might have less hits than wikipedia

Susning's software might be better than the current MediaWiki release

Susning might be running on a hell of a machine

....

Susning is in Swedish, limiting the possible audience to 9 million people plus Googlebot, so it should have less traffic than the English Wikipedia. Still, it pumped 50 GB of traffic (8 million hits) in October, comparable to the German Wikipedia. Wikipedia has had performance problems long before it had this much traffic.

...

The only thing you prove by these figures is that Susning is faster.

Yes, because this is the *only* thing that counts. Fix Wikipedia's response times and nobody will care about the architecture. I think you all focus too much on architecture and too little on time.

The focus on wallclock time is what drives my development of susning.

Think of wallclock time as a budget. You're spending 8.50 seconds on each request, and this is too much. Apparently it was possible in August to serve requests and only spend 1.5 seconds on each.

What would Alan Greenspan do? He would ask you to break down the sum. Where are the first 0.5 seconds spent? The next 0.5 secs? The next? Alan Greenspan would not fire you for spending 8.50 seconds, but he would fire you if you can't answer how they are spent, since that means you are out of control.

-- Lars Aronsson (lars@aronsson.se) Aronsson Datateknik - http://aronsson.se/

Jimmy Wales

1:17 a.m.

Jens Frank wrote:

...

I took the number "90%" as an example. You can make it 99% if you like, the result remains the same: One loadbalancer in front of two webservers reduces availability. Simple maths.

Simple but wrong maths, though.

The key is that the "availability" we're talking about is a function of what a given machine is doing. Individual webservers are likely to become unusable for many reasons -- apache problems, unbalanced traffic, etc. But the loadbalancer is nearly bulletproof.. and shields the user from failures of individual webservers in the cluster!

--Jimbo

Jimmy Wales

1:14 a.m.

Jens Frank wrote:

...

Well, the architecture shown does not increase availability since it adds a single point of failure. Let's assume a server's availability is 90% (It's higher, of course, but numbers will become too ugly for this example). Having one webserver and one database server the overall system availability would be 81%.

Clustering two web servers will increase the web servers availability to 99%. But now the system has three components: Load balancer (90%), Web servers (99%), database (90%). That's a total availability of 80.19%. Oops.

Your assumptions are wrong. The probability of a server falling over in some fashion depends on what's going on in the server. For a busy webserver, falling over is much more likely than for the load balancer.

In my experience, a load balancer setup with good (I mean reliable, speed isn't an issue) hardware doing *nothing* but NAT via iptables will have a huge uptime.

--Jimbo

7697

Age (days ago)

7699

Last active (days ago)

wikitech-l@lists.wikimedia.org

12 comments

5 participants

tags (0)

participants (5)

Brion Vibber
Jens Frank
Jimmy Wales
Lars Aronsson
Nick Reinking