Gabriel, Jimbo, list,
many arguments have been sent to the list. I've got the impression that
there's consensus about:
* Having squid+round robin DNS+heartbeat, we won't need linux director
for now.
* One squid box of > 2GHz CPU and >=2GByte RAM is capable to host significantly
more than the current workload of all servers, even when assuming doubling
during the next year.
* Redundancy is a must.
* single host reliability can be "low" if the farm is available, so high end
features are not required.
* Many similar boxes provide higher flexibility than many different boxes.
* backup box for the DB should have same CPU architecture than geoffrin
(Well, perhaps no consensus here, but no one contradicted yet).
* Current setup is a dual Athlon 1800 running apache and mysql and a
second, somewhat slower machine. Overall performance is slow.
Is this canonical?
I'd draw the conclusion that, after taking away anonymous pageviews from
the apache's (handing them over to squid) and after reviving geoffrin,
the current web servers would be performant enough to handle the current
workload. Looking at the trouble we had with pliny's stability, I'd prefer
to replace them. I think, this is also common understanding.
The "squid class" servers are at 1810$ at Silicon Mechanics
(2,6 GHz P4, 2*1GB Mem, 2*80GB SATA, no CD).
A Opteron box for the DB Backup server would be 2,810$ at
penguincomputing.com. (Dual Opteron 240, 4*512MB RAM, 2*80GB ATA, no CD)
6,430$ spent on these.
= Configuration =
The question is whether to use 4 of the "squid-class" single CPU servers
or two of the "DB Backup class" dual opterons. 4 of the small boxes would
cost 7,240$. Two of the big servers would be at 5,620$.
We would apparantly want to have at least one machine on site as hot
spare part and for testing. This would be another "web server class"
server, resulting in 5 small or 3 big ones, prices 9,050$ or 8,430$.
Not much of a difference, so money will not make the decision.
4 small ones sounds more reliable, if the squids work properly.
So that's 15,480$ in total.
Summary:
2*squid, small server
4*web server, small server,
1*hot spare/test, small server,
1*DB backup, Opteron box.
Comments?
Re: nine 2 GHz boxes instead: those are hard to get, and a 2,6GHz
CPU isn't that expensive any more, some 200$. It's not Xeon we're
talking about.
Best regards,
JeLuF
Jimbo wrote:
>Directly in the new colo, because moving them once they are in service
>would be very difficult. One of the things that has persuaded me to
>move is that both Bomis and Wikimedia are about to undergo major
>hardware upgrades, thus providing an easy opportuity for migration to
>a new facility.
Where is the new colo going to be? IMO it should be less than an hour away
from Brion and/or Jason (preferably somewhere in-between where they both
live).
-- Daniel Mayer (aka mav)
Tim wrote:
>Wikipedia will forever be haunted by problems such as these, since, at
>Anthere's request, I turned username blocking off by default. I'll turn
>it on at meta. Are there any other wikis you want it enabled at, while
>I'm at it? How about simple, wikibooks and wiktionary?
Oh, I thought it was the other way around. Might as well enable it on simple,
wikibooks and wiktionary.
-- mav
Can anyone give me some perspective on this celeron vs p4 issue?
For equivalent ghz, what's the difference likely to be *from
the perspective of webserving*?
I just really want to think everyone taking part in this discussion, I
think it's really valuable and very helpful to me in figuring out what
to do. I know it takes a lot of time to mull this over and visit the
websites to look for configurations and actual pricing data, but wow,
it is very very valuable.
--Jimbo
I've tweaked pliny's apache config a bit; keepalive is now off and the
max connections setting is turned up to the max of 255. This should
keep connections from piling up and forcing new ones to queue up quite
so often.
Pliny's still having those 'bad sector' errors occasionally. Obviously
it's a little worrying. :) Someone suggested a nice thorough fsck could
get it appropriately marked bad and thus worked around; this could help
once we've got something to take over for pliny sensibly and we can
take it offline...
Ursula is still straining with the databases; this is mostly
disk-bound, on a slow disk. I don't really want to add the web serving
on top of it.
If we get geoffrin back online and working soon, ursula can take over
for pliny while it's being fixed.
And, of course, there's the New Machine; if the overheating problems
are resolved and it works in the near future, I'd like to get it to
replace larousse in serving en.wikipedia.org. This will get rid of the
need for en2 until we get an all-around server farm set up.
-- brion vibber (brion @ pobox.com)
From
http://216.239.57.104/search?q=cache:kVfI1Wb2qmIJ:www.computer.org/micro/mi…
"To provide sufficient capacity to handle query traffic, our service
consists of multiple clusters distributed worldwide. Each cluster has
around a few thousand machines, and the geographically distributed setup
protects us against catastrophic data center failures (like those arising
from earthquakes and large-scale power failures). A DNS-based
load-balancing system selects a cluster by accounting for the user’s
geographic proximity to each physical cluster. The load-balancing system
minimizes round-trip time for the user’s request, while also considering
the available capacity at the various clusters."
Sounds like they're running a sort of Super Sparrow as well...
Some strange pics:
http://backrub.nerisoft.com/May1998/hardware.htm
--
Gabriel Wicke
If all goes well, Wikipedia will be able to use 3 add'l machines
tomorrow.
1. Geoffrin is fixed, and Jason is stress testing it today. We got a
new mobo from Penguin, no charge.
2. 'lb1' is an old bomis server -- it's a single Pentium 4 / IDE
machine. Possibly not much help, but it has a gig of ram and can do
some reasonable webserving. Possibly we should set this up as a
squid, to test the squid idea now, so we'll be ready to go in 2-3
weeks when I setup the new cluster?
3. 'other other new machine' -- this is the one I loaned Wikipedia
last time Jason went to SD, the one that fell off the net 30 minutes
after Brion started trying to copy stuff to it. Jason things that the
problem was overheating and can be solved with a reapplication of
thermal paste. Obviously, I don't consider this to be a reliable
machine, but it'll be there for us if we need it.
--Jimbo