[Foundation-l] Cluster report, September-November, 2005

Sun Nov 27 12:58:21 UTC 2005

Hello, just a shameless copy-paste from meta (http:// 
meta.wikimedia.org/wiki/Cluster_report%2C_September-November%2C_2005)

These months were yet again amazing in Wikimedia growth history.
Since September request rates doubled, lots of information added,  
modified and expanded, more users came.
To deal with that site had to improve both software and hardware  
platforms again.

Of course, more hardware was thrown at the problem.
In mid-September three new database servers (thistle,ixia,lomaria)  
were added to the pool, removing ancient type of hardware from the  
service.
With data growth rates 'old' 4GB-RAM boxes could not keep up with  
operation, except quite limited one.
40 dual-opteron application servers have been deployed, conserving  
our limited colocation space, as well as providing lots of  
performance for a buck.
One batch of them (20) was deployed just this week.
They're equipped with larger drives and more memory, thus allowing to  
place various unplanned services on them (9 apache servers are  
storing old revisions as well), some servers participate in shared  
memory pool, running memcached.

One of really efficient purchases was 12k$ worth image server  
'amane', providing us with storage space and even ability to to  
backup at current loads.
It is running now highly efficient and lightweight HTTP server -  
lighttpd.
So far images are served, but growth of Wikimedia Commons will force  
us to find a really scalable and reliable way to handle lots of media.

Additionally 10 more application servers are ordered together with a  
new Squid cache server batch.
These 10 single-opteron boxes will have 4 small and fast disks and  
should enable efficient caching of content.

As all this gear was bought for donated money, we really appreciate  
community help here, thank you!

Yahoo supplied cluster in Seoul, Korea has finally got into action,  
bringing cached content closer to Asian locations, as well as having  
master databases and application cluster for Japanese, Thai, Korean  
and Malaysian Wikipedias.

For internal load balancing Perlbal was replaced by LVS, and we've  
got a nice flashy donated load balancing device that may be deployed  
into operation soon as well.
LVS has to be handled with care and several tiny misconfiguration  
incidents seriously affected site performance.
Lately the cluster has became quite big and complex and now we need  
more sophisticated and extensive sanity checks and test cases.

There are lots of work in establishing more failover capabilities -  
we will be having two active links to our main ISP in Florida.
Static HTML dump is (becoming) nice and usable and may help us in  
case of serious crashes. It can be served from Amsterdam cluster as  
well!

As for last several days we managed to bring cluster into quite  
proper working shape, now it's important to fix everything and  
prepare for more load and more growth and yet another expansion.
We hope that we will be able with the help of community to solve all  
our performance and stability issues and avoid being Lohipedia :)

Lots of various problems were solved so far in order to achieve what  
we have now, and lots of low hanging fruits have been picked.
What is dealt now with is complex and needs manpower and fresh ideas  
as well.

Discussions are always welcome on #wikimedia-tech in Freenode (except  
during serious downtimes :).

And, of course, Thanks Team (or rather, Family)! It is amazing to  
work together!

Cheers,
Domas