Jürgen Herz wrote:
Hello,
I used the last hours trying to dig in the infrastructural
organization of the Wikimedia servers. My starting points where
[[meta:Wikimedia_servers]] and Ganglia and my motivation was
Wikipedias slowness in the last time.
In contrast to my expectations, the database servers are far away from
being under high load. It even seems the pressure is so low, you can
easily live without holbach and webster for days (resp. over a month).
Bottlenecks are Apaches and Squids (yes, I know that's nothing new for
you).
But like all other clusters too, the load is very unequally
distributed over the machines. For example the Yahoo! squids showed
yf1003 9.39, yf1000 7.60, yf1004 1.60, yf1002 1.44, yf1001 0.73
at noon (UTC) today and similar load values (albeit with a different
distribution) at other times.
That's just ordinary random variation. The 15 minute load average is much
more closely clustered than the 1 minute load average.
Or the Apaches in Florida:
16 Apaches with load around 15, 9 between 1.5 and 2, 8 between 1 and
1.5 and 10 less than 1.
Where does this come from, or is this wanted? Wouldn't a more balanced
load be better?
The apache load figures are unreliable at the moment because there is a
number of hung processes on each machine waiting for an NFS share that is
never going to start working again. But there's still a few points I can
make about this:
* The apache load also sees quite a lot of random variation from minute to
minute. This is unavoidable, and in my opinion, harmless. We can't tell how
much CPU time a request will require before we let apache accept the
connection. That's why we have multiprocessing.
* Perlbal suffered from a couple of load balancing problems, such as
oscillation of load between the perlbal servers. When we switched to LVS,
with weighted least connection scheduling, load distribution became much
more stable on the 10 minute time scale.
* Currently we're not using a higher weight for the dual-CPU apaches than
for the single-CPU apaches. It's likely that the optimal concurrency level
for dual-CPU machines is higher than for single-CPU machines. I'm not sure
what impact this has on throughput, but I suspect it would be fairly small,
especially in times of high load. To have high throughput, you just have to
have *enough* connections queued or active in order to keep the CPU busy,
and at high load that condition is likely to be satisfied.
The crucial thing to avoid in load balancing is a wide divergence in request
service time, wide enough to be user-visible. At the moment, we don't seem
to have that, except in a few special cases. A couple of hundred
milliseconds divergence is acceptable, it's when you get into the seconds
that you have a problem.
This is data from the last 24 hours or so:
mysql> select pf_server,pf_time/pf_count from profiling where
pf_name='-total';
+-----------+------------------+
| pf_server | pf_time/pf_count |
+-----------+------------------+
| srv15 | 204.56651921613 |
| goeje | 201.84655049787 |
| srv16 | 153.09295690508 |
| srv49 | 103.67533112583 |
| srv12 | 142.05136531207 |
| srv45 | 171.57344543147 |
| srv29 | 137.25579172808 |
| srv39 | 169.4883604123 |
| srv20 | 149.64453125 |
| srv18 | 159.00027606007 |
| srv41 | 143.65079066265 |
| srv22 | 159.61156337535 |
| srv25 | 141.55984462781 |
| srv21 | 142.70388359744 |
| diderot | 324.14045643154 |
| rabanus | 208.7263215859 |
| humboldt | 834.82322485207 |
| srv43 | 129.96146801287 |
| avicenna | 173.68361557263 |
| srv27 | 428.79877888655 |
| srv19 | 229.72577751196 |
| srv28 | 148.12455417799 |
| srv35 | 125.47606219212 |
| srv48 | 142.57025343643 |
| srv37 | 147.24074616948 |
| srv32 | 145.71093201754 |
| srv46 | 161.01125266335 |
| srv24 | 175.73745958135 |
| srv44 | 173.76259793447 |
| srv14 | 143.50663395904 |
| srv11 | 146.01194914544 |
| srv30 | 162.15520549113 |
| srv23 | 169.26885457677 |
| srv0 | 269.70157858274 |
| srv38 | 182.01380813953 |
| srv17 | 143.32407831225 |
| srv34 | 120.18475895904 |
| srv13 | 165.9256635274 |
| alrazi | 203.50503060089 |
| srv50 | 152.40960211391 |
| srv36 | 210.69465332031 |
| hypatia | 169.72124821556 |
| srv47 | 161.01089421174 |
| friedrich | 180.30277434721 |
| srv40 | 168.59215472028 |
| srv3 | 145.29130517183 |
| srv33 | 119.75965134641 |
| srv4 | 149.60560154037 |
| kluge | 173.34889083873 |
| rose | 675.57995605469 |
+-----------+------------------+
50 rows in set (0.01 sec)
Humboldt and rose need attention, but I wouldn't worry about the rest.
Other point: The Yahoo! Squids do virtually nothing
between 18:00 and
0:00 (and machines besides yf1000-yf1004 to virtually nothing around
the clock). How nice would it be make them helping out the other
overloaded machines in Florida and Netherlands at least in these six
hours.
This was an intentional part of our architecture. Local squid clusters serve
local users, this reduces latency due to network round-trip time (RTT). Of
course, this only makes sense if that network RTT (200ms or so) is greater
than the local service time due to high load. Hopefully we can fix our squid
software problems and add more squid hardware to Florida (which is the only
site where it truly seems to be lacking), and thus maintain this design.
Spreading load across servers distributed worldwide is cheaper but
necessarily slower than the alternative of minimising distance between cache
and user.
It might interest you to know that as soon as I came to terms with the fact
that the knams cluster would be idle at night, I promoted the idea of using
that idle CPU time for something useful, like non-profit medical research.
We've now clocked up 966 work units for Folding@Home:
http://fah-web.stanford.edu/cgi-bin/main.py?qtype=userpage&username=wik…
And no, I don't criticize anyone or know how to do
it better. But
available informations look strange to me - it would be great to get
some explanations.
Speaking of explanations. I've three more simple questions:
1. Squids at lopar idle all the time since dns has been moved of them.
What where the problems with them and will they be back soon?
2. Commons is very slow since the move from the prior "overloaded"
server to the new one. Any explanation to satisfy a simple user?
And what server is the new one?
Brion answered those two well enough.
3. I read about new machines srv51-70. Where do they
come from? Can't
see a recent order for them or they are mentioned on
[[meta:Wikimedia_servers]].
No idea. I just use the things.
-- Tim Starling