[Wikitech-l] Re: Wikimedia Servers and Organization

20 Nov 2005

      Hello,
...
...
But like all other clusters too, the load is very unequally
distributed over the machines. For example the Yahoo! squids showed
yf1003 9.39, yf1000 7.60, yf1004 1.60, yf1002 1.44, yf1001 0.73
at noon (UTC) today and similar load values (albeit with a different
distribution) at other times.
That's just ordinary random variation. The 15 minute load average is much
more closely clustered than the 1 minute load average.
Oh, indeed. Without switching to load_fifteen it's hard to see in the
cluster overview.
...
...
Or the Apaches in Florida:
16 Apaches with load around 15, 9 between 1.5 and 2, 8 between 1 and
1.5 and 10 less than 1.
Where does this come from, or is this wanted? Wouldn't a more balanced
load be better?
The apache load figures are unreliable at the moment because there is a
number of hung processes on each machine waiting for an NFS share that is
never going to start working again.
Comparing the servers named in bug 3869 - yes, those are those showing
constantly very high loads.
...
The crucial thing to avoid in load balancing is a wide divergence in request
service time, wide enough to be user-visible. At the moment, we don't seem
to have that, except in a few special cases. A couple of hundred
milliseconds divergence is acceptable, it's when you get into the seconds
that you have a problem.
This is data from the last 24 hours or so:
[...]
Humboldt and rose need attention, but I wouldn't worry about the rest.
Hm, yes, that's right.
As I noticed on Friday already all machines CPU usage (user+system) is
only at about 60% in the long time mean. That looks ok.
...
This was an intentional part of our architecture. Local squid clusters serve
local users, this reduces latency due to network round-trip time (RTT). Of
course, this only makes sense if that network RTT (200ms or so) is greater
than the local service time due to high load. Hopefully we can fix our squid
software problems and add more squid hardware to Florida (which is the only
site where it truly seems to be lacking), and thus maintain this design.
Spreading load across servers distributed worldwide is cheaper but
necessarily slower than the alternative of minimising distance between cache
and user.
The RTT between me (Germany) to knams is the same as to pmtpa (around
60ms) but it's 350ms to yaseo. That's of course a lot higher, but if
they would have the right data available, this would be still a lot
faster than the dozent seconds (if not timing out) the knams Squids need
to deliver pages.
I must admit it's not always slow but if, it is constantly for a longer
time (mostly in the evening hours). But then I can't see noticeable
problems on Ganglia. I mean I know that Wikipedia's servers have to
serve a lot requests and they increased constantly (judging from the
graphs at NOC) but if it's slow there's no higher load, not more
requests or so - strange.
Maybe it's some internal reason like waiting for NFS server or so, but
I'm sure you'll make it work smooth again.
...
It might interest you to know that as soon as I came to terms with the fact
that the knams cluster would be idle at night, I promoted the idea of using
that idle CPU time for something useful, like non-profit medical research.
We've now clocked up 966 work units for Folding@Home:
http://fah-web.stanford.edu/cgi-bin/main.py?qtype=userpage&username=wiki...
Oh, that's nice indeed. If something hinders a reasonable worldwide
useage, that's surely the right thing to do with the processing power.
...
...

I read about new machines srv51-70. Where do they come from? Can't

see a recent order for them or they are mentioned on
[[meta:Wikimedia_servers]].
No idea. I just use the things.
Ah, so it must be great seeing 20 servers disappear suddenly to use. ;-)
Jürgen

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

[Wikitech-l] Re: Wikimedia Servers and Organization