[Foundation-l] Operations report, Nov 2005-Aug 2006

Sat Aug 19 07:54:10 UTC 2006

This hasn't been done for a while, so I'll try to sum up changes in  
our operations since November, 2005.

There has been much less insane headless chicken run and we've seen  
quite steady operation operation (except few hiccups) lately.
First of all, we could afford for a while ordering hardware before we  
were completely overloaded - it was constant tune in previous years.
There were lots of system architecture changes lately too - the way  
how we store data, the way how we serve and cache images, and text.

==Hardware==

One of good news is that we can still stay at same class of database  
servers, which even are getting much cheaper than before.
Database server cost per unit went from $15000 in Jun, 2005 to $12500  
in October, 2005, to $9070 in March, 2006.
We got four of these servers in March and called them... db1, db2,  
db3 and db4.

For application environment we did a single $100000 purchase, that  
provided us with 40 high performance servers (with two dual core  
opteron processors and 4GB of RAM each).
This nearly doubled our CPU capacity, and also provided enough of  
space for revision storage, in-memory caching, etc.

For our current caching layer expansion we ordered 20 high  
performance servers (8GB memory, four fast disks, $3300 each), which  
should appear in production in ~one month.
We're investigating possibilities of adding more hardware in  
Amsterdam cluster. We might end up with 10 additional cache servers  
there too.

We also purchased $40000-worth of Foundry hardware, based on their  
BigIron RX-8 platform.
We will use that as our highly available core routing layer, as well  
as connectivity for most demanding servers.
As well, this will allow flexible networking with upstream providers.

Our next purchase will be image hosting/archival systems, and now  
there's still ongoing investigation whether to use our previous  
approach (big cheap server with lots of big cheap disks), or to  
deploy some storage appliance.

We reallocated some aging servers to search cluster and other  
auxiliary, and still continue this practice, so that we'd end up with  
more homogenous application environment.

==Software==

There were lots of improvements in MediaWiki itself, but additionally  
Tim and Mark ended up in Squid authors list - changes made in it's  
code were critical to proper squid performance.
We did split database cluster, with English Wikipedia ending up on  
separate set of boxes.
Some of old database servers got their new life being slaves just of  
few languages, thus compensating lack of memory or fast disk system.
Additionally revision storage was moved from our core database boxes  
to 'external storage clusters', which are our application servers  
utilizing their idle disks.

In optimization work multiple factors are being worked on.
"Make it faster" means not only having more requests per second  
served, but also reducing response times, and both issues are worked  
on constantly.

And of course, as always, team has been marvelous ;-) Thanks!

-- 
Domas Mituzas -- http://dammit.lt/ -- [[user:midom]]