Hello,
But like all other clusters too, the load is very unequally distributed over the machines. For example the Yahoo! squids showed yf1003 9.39, yf1000 7.60, yf1004 1.60, yf1002 1.44, yf1001 0.73 at noon (UTC) today and similar load values (albeit with a different distribution) at other times.
That's just ordinary random variation. The 15 minute load average is much more closely clustered than the 1 minute load average.
Oh, indeed. Without switching to load_fifteen it's hard to see in the cluster overview.
Or the Apaches in Florida: 16 Apaches with load around 15, 9 between 1.5 and 2, 8 between 1 and 1.5 and 10 less than 1.
Where does this come from, or is this wanted? Wouldn't a more balanced load be better?
The apache load figures are unreliable at the moment because there is a number of hung processes on each machine waiting for an NFS share that is never going to start working again.
Comparing the servers named in bug 3869 - yes, those are those showing constantly very high loads.
The crucial thing to avoid in load balancing is a wide divergence in request service time, wide enough to be user-visible. At the moment, we don't seem to have that, except in a few special cases. A couple of hundred milliseconds divergence is acceptable, it's when you get into the seconds that you have a problem.
This is data from the last 24 hours or so: [...] Humboldt and rose need attention, but I wouldn't worry about the rest.
Hm, yes, that's right.
As I noticed on Friday already all machines CPU usage (user+system) is only at about 60% in the long time mean. That looks ok.
This was an intentional part of our architecture. Local squid clusters serve local users, this reduces latency due to network round-trip time (RTT). Of course, this only makes sense if that network RTT (200ms or so) is greater than the local service time due to high load. Hopefully we can fix our squid software problems and add more squid hardware to Florida (which is the only site where it truly seems to be lacking), and thus maintain this design. Spreading load across servers distributed worldwide is cheaper but necessarily slower than the alternative of minimising distance between cache and user.
The RTT between me (Germany) to knams is the same as to pmtpa (around 60ms) but it's 350ms to yaseo. That's of course a lot higher, but if they would have the right data available, this would be still a lot faster than the dozent seconds (if not timing out) the knams Squids need to deliver pages.
I must admit it's not always slow but if, it is constantly for a longer time (mostly in the evening hours). But then I can't see noticeable problems on Ganglia. I mean I know that Wikipedia's servers have to serve a lot requests and they increased constantly (judging from the graphs at NOC) but if it's slow there's no higher load, not more requests or so - strange. Maybe it's some internal reason like waiting for NFS server or so, but I'm sure you'll make it work smooth again.
It might interest you to know that as soon as I came to terms with the fact that the knams cluster would be idle at night, I promoted the idea of using that idle CPU time for something useful, like non-profit medical research. We've now clocked up 966 work units for Folding@Home:
http://fah-web.stanford.edu/cgi-bin/main.py?qtype=userpage&username=wiki...
Oh, that's nice indeed. If something hinders a reasonable worldwide useage, that's surely the right thing to do with the processing power.
- I read about new machines srv51-70. Where do they come from? Can't
see a recent order for them or they are mentioned on [[meta:Wikimedia_servers]].
No idea. I just use the things.
Ah, so it must be great seeing 20 servers disappear suddenly to use. ;-)
Jürgen