Daniel Mayer wrote:
Not sure if this has been discussed yet, but we have US$95,000 budgeted for hardware this quarter ($20,000 of which for Extra hardware/dev projects ; meaning hardware and/or paying for critical software development). This does not include whatever has been bought already (Jamesday ; how much have we spent since the January order? I don't have the bank records for that yet.).
http://wikimediafoundation.org/wiki/Budget/2005
I’d really like to have a general idea on what we want to buy soon and if that can be bought before the end of the quarter in 2 weeks, then that’d make accounting a bit easier for me :).
Note: About $20,000 of the money generated is in the Wikimedia Deutschland bank account. The easiest way I can see to use that money is to buy equipment that would be used in a German-based datacenter. I’ve been told that there are many different offers for free/subsidized hosting in Germany. http://wikimediafoundation.org/wiki/Fund_drives/2005/Q1
So, what do we want to buy?
Daniel Mayer (aka mav)
You asked for some opinions: well, here are my ramblings on the matter. I hope they are reasonably on-topic.
Don't buy cheap anything: at this scale of operations, it costs more than it saves. I wouldn't buy any more boxes from your previous suppliers: the failure rates (for whatever reasons) have been disastrous. Even buying Dell rack-mount boxes would be better: they're (fairly) reliable, (fairly) well-built, and you can buy them anywhere in the world, which is good for standardization. If not Dell, consider some other global supplier. Accept that stuff will break, and break all the time; a good supplier will come to your colo, take the old server away, and fit a new one in your rack. You'll never see the old one again: no waiting for return to factory, or the vendor wanting you to test it yourself. _This_ sort of service is what you pay the big suppliers for, and it's worth it, when you are running a serious 24/7 system. Also, if you are considering a big order to a single supplier, they might be willing to give you a large discount in return for being put on the Wikipedia Foundation's list of sponsors.
Regardless of supplier, I'd try not to buy lots of custom-tailored boxes for squids or apaches: just spec one generic box to handle either role, as well as generic glue computing (routing, load balancing, etc., etc.) . Put as much memory in them as you can afford, and big fat NICs. The money you appear to waste now, you'll save it in the future by having as big a pool of similar machines as possible, reducing inventory and spares problems. This way, you can also build standard install images, or netinstall images, for machines, and deploy them in a "cookie cutter" way when you need to clone more squids or apaches. If you configure the install images / packages right, you should also be able to run them as virtual machines during testing.
Order 64-bit capable machines, of course, for their security advantages, and speed advantages when running 64-bit software. Move to 64-bit software as soon as reasonably possible.
Consider having two of everything: two NFS servers, two mail servers, etc. I know, you are already part way towards this). If necessary, consolidate numerous low-throughput services onto a single server, then have a second server as a complete backup for the first. N services on a single server with a backup is muxh more reliable than N services on N servers. If you need two servers to handle the throughput, buy three.
Remember that your tightest constraint is admin skills and time; having a highly standardized system for deployment will save you _vast_ amounts of scarce time, and any deployment automation you can use will make for vastly more reliable systems deployment. Consider Debian as both an operating system and a deployment system: at the moment, different machines run different operating system flavours, making sysadminning harder.
Note that the software packaging approach for deployment also makes it easier to deploy in remote sites: even on hardware other than your own.
Databases are a different issue: you can't apply the same commodity thinking: however, try to order DB machines in identical multiples, too, for the same reasons. Clearly the DB machines will need to be hand-crafted. I don't know much about databases on this sort of scale... but I imagine the Wikipedia developers do.
Consider investing in some simple environmental monitoring: power drain, temperature, humidity, in various places. You can get rack-mount units for this. Overheating, or excess dryness or wetness, is seriously bad for your servers, and hence your data and syadmin sanity. An ounce of prevention is again worth a pound of cure.
Finaly, you might consider buying a cheap radio clock on a per-site basis, if your colo does not already provide you with a local stratum-1 NTP feed. Keeping your distributed clusters around the world in time sync, even when connectivity is lost, is a good thing when you have a widely distributed system that wants to keep global consistency.
-- Neil