Hello,
I used the last hours trying to dig in the infrastructural organization of the Wikimedia servers. My starting points where [[meta:Wikimedia_servers]] and Ganglia and my motivation was Wikipedias slowness in the last time.
In contrast to my expectations, the database servers are far away from being under high load. It even seems the pressure is so low, you can easily live without holbach and webster for days (resp. over a month). Bottlenecks are Apaches and Squids (yes, I know that's nothing new for you).
But like all other clusters too, the load is very unequally distributed over the machines. For example the Yahoo! squids showed yf1003 9.39, yf1000 7.60, yf1004 1.60, yf1002 1.44, yf1001 0.73 at noon (UTC) today and similar load values (albeit with a different distribution) at other times.
Or the Apaches in Florida: 16 Apaches with load around 15, 9 between 1.5 and 2, 8 between 1 and 1.5 and 10 less than 1.
Where does this come from, or is this wanted? Wouldn't a more balanced load be better?
Other point: The Yahoo! Squids do virtually nothing between 18:00 and 0:00 (and machines besides yf1000-yf1004 to virtually nothing around the clock). How nice would it be make them helping out the other overloaded machines in Florida and Netherlands at least in these six hours.
And no, I don't criticize anyone or know how to do it better. But available informations look strange to me - it would be great to get some explanations.
Speaking of explanations. I've three more simple questions: 1. Squids at lopar idle all the time since dns has been moved of them. What where the problems with them and will they be back soon? 2. Commons is very slow since the move from the prior "overloaded" server to the new one. Any explanation to satisfy a simple user? And what server is the new one? 3. I read about new machines srv51-70. Where do they come from? Can't see a recent order for them or they are mentioned on [[meta:Wikimedia_servers]].
Thank you in advance, Juergen
Jürgen Herz wrote:
But like all other clusters too, the load is very unequally distributed over the machines.
[snip]
Where does this come from, or is this wanted? Wouldn't a more balanced load be better?
More balanced would likely be better. Spreading load evenly seems to be really hard to get right; if you have advice based on experience I'm sure we'd love to hear it.
Other point: The Yahoo! Squids do virtually nothing between 18:00 and 0:00 (and machines besides yf1000-yf1004 to virtually nothing around the clock). How nice would it be make them helping out the other overloaded machines in Florida and Netherlands at least in these six hours.
What would they do during this time?
- Squids at lopar idle all the time since dns has been moved of them.
What where the problems with them and will they be back soon?
They're older, slower machines and can handle only a small fraction of what we pump through the Amsterdam cluster, so not too sure about these.
- Commons is very slow since the move from the prior "overloaded"
server to the new one. Any explanation to satisfy a simple user? And what server is the new one?
It was already slow *before* that. The files were moved to a faster internal server, but the web interface is still on a slow machine until things get finalized. This means images are still slow to load (alas) but don't bog down the primary wiki web servers as much when they poke at the images.
-- brion vibber (brion @ pobox.com)
Jürgen Herz wrote:
Hello,
I used the last hours trying to dig in the infrastructural organization of the Wikimedia servers. My starting points where [[meta:Wikimedia_servers]] and Ganglia and my motivation was Wikipedias slowness in the last time.
In contrast to my expectations, the database servers are far away from being under high load. It even seems the pressure is so low, you can easily live without holbach and webster for days (resp. over a month). Bottlenecks are Apaches and Squids (yes, I know that's nothing new for you).
But like all other clusters too, the load is very unequally distributed over the machines. For example the Yahoo! squids showed yf1003 9.39, yf1000 7.60, yf1004 1.60, yf1002 1.44, yf1001 0.73 at noon (UTC) today and similar load values (albeit with a different distribution) at other times.
That's just ordinary random variation. The 15 minute load average is much more closely clustered than the 1 minute load average.
Or the Apaches in Florida: 16 Apaches with load around 15, 9 between 1.5 and 2, 8 between 1 and 1.5 and 10 less than 1.
Where does this come from, or is this wanted? Wouldn't a more balanced load be better?
The apache load figures are unreliable at the moment because there is a number of hung processes on each machine waiting for an NFS share that is never going to start working again. But there's still a few points I can make about this:
* The apache load also sees quite a lot of random variation from minute to minute. This is unavoidable, and in my opinion, harmless. We can't tell how much CPU time a request will require before we let apache accept the connection. That's why we have multiprocessing. * Perlbal suffered from a couple of load balancing problems, such as oscillation of load between the perlbal servers. When we switched to LVS, with weighted least connection scheduling, load distribution became much more stable on the 10 minute time scale. * Currently we're not using a higher weight for the dual-CPU apaches than for the single-CPU apaches. It's likely that the optimal concurrency level for dual-CPU machines is higher than for single-CPU machines. I'm not sure what impact this has on throughput, but I suspect it would be fairly small, especially in times of high load. To have high throughput, you just have to have *enough* connections queued or active in order to keep the CPU busy, and at high load that condition is likely to be satisfied.
The crucial thing to avoid in load balancing is a wide divergence in request service time, wide enough to be user-visible. At the moment, we don't seem to have that, except in a few special cases. A couple of hundred milliseconds divergence is acceptable, it's when you get into the seconds that you have a problem.
This is data from the last 24 hours or so:
mysql> select pf_server,pf_time/pf_count from profiling where pf_name='-total'; +-----------+------------------+ | pf_server | pf_time/pf_count | +-----------+------------------+ | srv15 | 204.56651921613 | | goeje | 201.84655049787 | | srv16 | 153.09295690508 | | srv49 | 103.67533112583 | | srv12 | 142.05136531207 | | srv45 | 171.57344543147 | | srv29 | 137.25579172808 | | srv39 | 169.4883604123 | | srv20 | 149.64453125 | | srv18 | 159.00027606007 | | srv41 | 143.65079066265 | | srv22 | 159.61156337535 | | srv25 | 141.55984462781 | | srv21 | 142.70388359744 | | diderot | 324.14045643154 | | rabanus | 208.7263215859 | | humboldt | 834.82322485207 | | srv43 | 129.96146801287 | | avicenna | 173.68361557263 | | srv27 | 428.79877888655 | | srv19 | 229.72577751196 | | srv28 | 148.12455417799 | | srv35 | 125.47606219212 | | srv48 | 142.57025343643 | | srv37 | 147.24074616948 | | srv32 | 145.71093201754 | | srv46 | 161.01125266335 | | srv24 | 175.73745958135 | | srv44 | 173.76259793447 | | srv14 | 143.50663395904 | | srv11 | 146.01194914544 | | srv30 | 162.15520549113 | | srv23 | 169.26885457677 | | srv0 | 269.70157858274 | | srv38 | 182.01380813953 | | srv17 | 143.32407831225 | | srv34 | 120.18475895904 | | srv13 | 165.9256635274 | | alrazi | 203.50503060089 | | srv50 | 152.40960211391 | | srv36 | 210.69465332031 | | hypatia | 169.72124821556 | | srv47 | 161.01089421174 | | friedrich | 180.30277434721 | | srv40 | 168.59215472028 | | srv3 | 145.29130517183 | | srv33 | 119.75965134641 | | srv4 | 149.60560154037 | | kluge | 173.34889083873 | | rose | 675.57995605469 | +-----------+------------------+ 50 rows in set (0.01 sec)
Humboldt and rose need attention, but I wouldn't worry about the rest.
Other point: The Yahoo! Squids do virtually nothing between 18:00 and 0:00 (and machines besides yf1000-yf1004 to virtually nothing around the clock). How nice would it be make them helping out the other overloaded machines in Florida and Netherlands at least in these six hours.
This was an intentional part of our architecture. Local squid clusters serve local users, this reduces latency due to network round-trip time (RTT). Of course, this only makes sense if that network RTT (200ms or so) is greater than the local service time due to high load. Hopefully we can fix our squid software problems and add more squid hardware to Florida (which is the only site where it truly seems to be lacking), and thus maintain this design. Spreading load across servers distributed worldwide is cheaper but necessarily slower than the alternative of minimising distance between cache and user.
It might interest you to know that as soon as I came to terms with the fact that the knams cluster would be idle at night, I promoted the idea of using that idle CPU time for something useful, like non-profit medical research. We've now clocked up 966 work units for Folding@Home:
http://fah-web.stanford.edu/cgi-bin/main.py?qtype=userpage&username=wiki...
And no, I don't criticize anyone or know how to do it better. But available informations look strange to me - it would be great to get some explanations.
Speaking of explanations. I've three more simple questions:
- Squids at lopar idle all the time since dns has been moved of them.
What where the problems with them and will they be back soon? 2. Commons is very slow since the move from the prior "overloaded" server to the new one. Any explanation to satisfy a simple user? And what server is the new one?
Brion answered those two well enough.
- I read about new machines srv51-70. Where do they come from? Can't
see a recent order for them or they are mentioned on [[meta:Wikimedia_servers]].
No idea. I just use the things.
-- Tim Starling
Tim Starling wrote: <snip>
The apache load figures are unreliable at the moment because there is a number of hung processes on each machine waiting for an NFS share that is never going to start working again.
<snip>
If any root is interested in fixing those apaches, I listed them in a bug report on bugzilla:
Ashar Voultoiz wrote:
If any root is interested in fixing those apaches, I listed them in a bug report on bugzilla:
http://bugzilla.wikimedia.org/show_bug.cgi?id=3869
Hello,
But like all other clusters too, the load is very unequally distributed over the machines. For example the Yahoo! squids showed yf1003 9.39, yf1000 7.60, yf1004 1.60, yf1002 1.44, yf1001 0.73 at noon (UTC) today and similar load values (albeit with a different distribution) at other times.
That's just ordinary random variation. The 15 minute load average is much more closely clustered than the 1 minute load average.
Oh, indeed. Without switching to load_fifteen it's hard to see in the cluster overview.
Or the Apaches in Florida: 16 Apaches with load around 15, 9 between 1.5 and 2, 8 between 1 and 1.5 and 10 less than 1.
Where does this come from, or is this wanted? Wouldn't a more balanced load be better?
The apache load figures are unreliable at the moment because there is a number of hung processes on each machine waiting for an NFS share that is never going to start working again.
Comparing the servers named in bug 3869 - yes, those are those showing constantly very high loads.
The crucial thing to avoid in load balancing is a wide divergence in request service time, wide enough to be user-visible. At the moment, we don't seem to have that, except in a few special cases. A couple of hundred milliseconds divergence is acceptable, it's when you get into the seconds that you have a problem.
This is data from the last 24 hours or so: [...] Humboldt and rose need attention, but I wouldn't worry about the rest.
Hm, yes, that's right.
As I noticed on Friday already all machines CPU usage (user+system) is only at about 60% in the long time mean. That looks ok.
This was an intentional part of our architecture. Local squid clusters serve local users, this reduces latency due to network round-trip time (RTT). Of course, this only makes sense if that network RTT (200ms or so) is greater than the local service time due to high load. Hopefully we can fix our squid software problems and add more squid hardware to Florida (which is the only site where it truly seems to be lacking), and thus maintain this design. Spreading load across servers distributed worldwide is cheaper but necessarily slower than the alternative of minimising distance between cache and user.
The RTT between me (Germany) to knams is the same as to pmtpa (around 60ms) but it's 350ms to yaseo. That's of course a lot higher, but if they would have the right data available, this would be still a lot faster than the dozent seconds (if not timing out) the knams Squids need to deliver pages.
I must admit it's not always slow but if, it is constantly for a longer time (mostly in the evening hours). But then I can't see noticeable problems on Ganglia. I mean I know that Wikipedia's servers have to serve a lot requests and they increased constantly (judging from the graphs at NOC) but if it's slow there's no higher load, not more requests or so - strange. Maybe it's some internal reason like waiting for NFS server or so, but I'm sure you'll make it work smooth again.
It might interest you to know that as soon as I came to terms with the fact that the knams cluster would be idle at night, I promoted the idea of using that idle CPU time for something useful, like non-profit medical research. We've now clocked up 966 work units for Folding@Home:
http://fah-web.stanford.edu/cgi-bin/main.py?qtype=userpage&username=wiki...
Oh, that's nice indeed. If something hinders a reasonable worldwide useage, that's surely the right thing to do with the processing power.
- I read about new machines srv51-70. Where do they come from? Can't
see a recent order for them or they are mentioned on [[meta:Wikimedia_servers]].
No idea. I just use the things.
Ah, so it must be great seeing 20 servers disappear suddenly to use. ;-)
Jürgen
Jürgen Herz wrote:
The RTT between me (Germany) to knams is the same as to pmtpa (around 60ms) but it's 350ms to yaseo. That's of course a lot higher, but if they would have the right data available, this would be still a lot faster than the dozent seconds (if not timing out) the knams Squids need to deliver pages.
I must admit it's not always slow but if, it is constantly for a longer time (mostly in the evening hours). But then I can't see noticeable problems on Ganglia. I mean I know that Wikipedia's servers have to serve a lot requests and they increased constantly (judging from the graphs at NOC) but if it's slow there's no higher load, not more requests or so - strange. Maybe it's some internal reason like waiting for NFS server or so, but I'm sure you'll make it work smooth again.
The main reason for slow service squid service times lately seems to have been memory issues. A couple of days ago, one of the knams squids was very slow (often tens of seconds) because it was swapping, and another was heading that way.
System administration issues like this are a very common cause of slowness. There's no magic bullet to solve it -- it's just a matter of progressively improving our techniques, and increasing the size of the sysadmin team. Lack of hardware may be an issue for certain services, but identifying which services are the problem, determining what we need to order, and then working out which part of the chain will give out next, is no easy task. We have 3 squid clusters and 2 apache clusters with their own memcached, DB, NFS and search -- if any one of those services has a problem, it will lead to a slow user experience. To add to the headache, many reports of slowness are due to problems with the client network rather than with our servers.
Luckily most of our monitoring statistics are public, so the entry barrier to this kind of performance analysis is low. I'm glad you're taking an interest. If you want to offer advice on a real-time basis, the #wikimedia-tech channel on irc.freenode.net is the best place to do it.
-- Tim Starling
wikitech-l@lists.wikimedia.org