Hello colleagues and shareholders (community :)!
Has been a while since my last review of operations (aka hosting report) - so I will try to overview some of things we've been doing =) First of all, I'd like to thank mr.Moore for his fabulous law. It allowed Wikipedia to stay alive - even though we had to grow again in all directions.
We still have Septembers. Well, it is a nice name to describe the recurring pattern, which provides Shock and Awe to us - after a period of stable usage, every autumn number of users suddenly goes up and stays there - to allow us think we've finally reached some saturation and will never grow more. Until next September. We still have World Events. People rush to us to read about conflicts and tragedies, joys and celebrations. Sometimes because we had information for ages, sometimes because it all matured in seconds or minutes. Nowhere else document can require that much of concurrent collaboration, and nowhere else it can provide as much value immediately. We still have history. From day one of the project, we can see people going into dramas, discussing, evolving and revolving every idea on the site. Every edit stays there - accumulating not only final pieces of information, but the whole process of assembling the content. We still advance. Tools to facilitate the community get more complex, we start growing ecosystem of tools and processes inside and outside core software and platform. Users are the actual developers of the project, core technology just lags behind assisting.
Our operation becomes more and more demanding - and thats quite a bit of work to handle.
Ok, enough of such poetic introduction :)
== Growth ==
Over second half of 2006 traffic and reqeuests to our cluster doubled (actually, that happened just in few months) Over 2007 traffic and requests to our cluster doubled.
Pics: http://www.nedworks.org/~mark/reqstats/trafficstats-yearly.png http://www.nedworks.org/~mark/reqstats/reqstats-yearly.png
== Hardware expansion ==
Back in September 2006 we had quite huge load increase, and we went for capacity expansion, which included: * 20 new Squid servers ($66k) * 2 storage servers ($24k) * 60 application servers ($232k)
German foundation additionally assisted with purchasing 15 Squid servers in November for Amsterdam facility.
Later in January 2007 we added 6 more database servers (for $39k), three additional application servers for auxiliary tasks (such as mail), and some network and datacenter gear.
The growth over autumn/winter led us to quite big ($240k) capacity expansion back in March, which included: * 36 very capable 8-core application servers (thank you Moore yet again :) - that was around $120k * 20 Squid servers for Tampa facility * Router for Amsterdam facility * Additional networking gear (switches, linecards, etc) for Tampa
The only serious capacity increase afterwards was another 'German' (thanks yet again, Verein) batch of 15 Squid servers for Amsterdam in December 2007.
We do plan to improve on database and storage servers soon - that would add to stability of our dumps building and processing, as well as better support for various batch jobs.
We have been especially pushy about exploiting warranties on all servers, and nearly all machines ever purchased are in working state, doing one or another kind of workload. All the veterans of 2005 are still running at amazing speeds doing the important jobs :) Rob joining to help us with datacenter operations has allowed to have really nice turnarounds with pretty much every datacenter work - as volunteer remote hands became not available during critical moments anymore. Oh, and look how tidy cabling is: http://flickr.com/photos/ midom/2134991985/ !
== Networking ==
This has been mainly in capable Mark's and River's hands - where we underwent transition from hosting customer to internet service provider (or at least - equal peer to ISPs) ourselves. We have our independent autonomous systems both in Europe and US - allowing to pick best available connectivity options, resolve routing glitches, and get free traffic peering at internet exchanges. That provides quite lots of flexibility, of course, at the cost of more work and skills required.
This is also part of overall well-managed powerful datacenter strategy. Instead of low-efficiency small datacenters scattered around the world, core facility like one in Amsterdam provides high availability, close proximity to major Internet hubs and carriers, and is generally in center of region's inter-tubes. Though it would be possible to reach out into multiple donated hosting places, that would just lead to slower service for our users, and someone would still have to pay for the bandwidth. As we are pushing nearly 4 Gbps of traffic, there're not much donors who wouldn't feel such traffic.
== Software ==
There has been lots of overall engineering effort, that was often behind the scenes. Various bits had to be rewritten to act properly on user activity. The most prominent example of such work is Tim's rewrite of parser to more efficiently handle huge template hierarchies. In perfect case, users will not see any visible change, except multiple-factor faster performance at expensive operations. In past year, lots of activities - how people use customized software - bots, javascript extensions, etc - have changed performance profile, and nowadays lots of performance work at backend is to handle various fresh activities - and anomalies. One of core activities was polishing caching of our content, so we could have our application layer to concentrate on most important process - collaboration, instead of content delivery. Lots and lots of small things have been added or fixed - though some developments where quite demanding - like multimedia integration, which was challenging due to our freedom requirements. Still, there was constant tradeoff management, as not every feature was worth the performance sacrifice and costs, and on the other hand - having the best possible software for collaboration is also important :) Introducing new features, or migrating them from outside to the core platform has been always serious engineering effort. Besides, there would be quite a lot of communication - explaining how things have to be built for them not to collapse at live site, discussing security implications, change of usage patterns, ... Of course, MediaWiki is still one of most actively developed web software - and here Brion and Tim lead the volunteers, as well, as spend their days and nights in the code.
At the overall stack, we have worked at every layer - tuning kernels for our high-performance networking, experimenting with database software (some servers are running our own fork of MySQL, based on Google changes), perfecting Squid (Mark and Tim ended up in authors list) - our web caching software, digging into problems and specialties of PHP engine. Quite a lot of problems we hit are very huge-site-specific, and even if other huge shops hit them, we're the ones that are always free to release our changes and fixes. Still, colleagues from other shops are willing to assist us too :)
There were lots of tiny architecture tweaks - that allowed us to use resources more efficiently, but none of them are any major - pure engineering all the time. It seems, that lately we stabilized on lots of things in how Wikipedia works - and it seems to work quite fluently. Of course, one must mention Jens' keen eye, taking care of various especially important but easily overlooked things.
River has dedicated lots of attention to supporting the community tools infrastructure at the Toolserver - and also maintaining off- site copies of projects.
Site doesn't fall down the very same minute nobody is looking at it, and it is quite an improvement over the years :)
== Notes ==
People have been discussing if running a popular site is really a mission of WMF. Well, users created magnificent resource, we try to support it, we do what we can. Thanks to everyone involved - though it has been far less stressful ride than previous years, still, nice work. ;-)
== More reading ==
May hurt your eyes: https://wikitech.leuksman.com/view/Server_admin_log Platform description: http://dammit.lt/uc/workbook2007.pdf
== Disclaimer ==
Some numbers can be wrong, as this review was based not on audit, but on vague memories :)
On 1/4/08, Domas Mituzas midom.lists@gmail.com wrote:
Has been a while since my last review of operations (aka hosting report) - so I will try to overview some of things we've been doing =) First of all, I'd like to thank mr.Moore for his fabulous law. It allowed Wikipedia to stay alive - even though we had to grow again in all directions.
Thanks Domas for the report, and thanks to you and the rest of the team. :-)
On 04/01/2008, Domas Mituzas midom.lists@gmail.com wrote:
Hello colleagues and shareholders (community :)! Has been a while since my last review of operations (aka hosting report) - so I will try to overview some of things we've been doing =)
Submitted to Slashdot, please vote per your feelings on Slashdotting dammit.lt ;-p
http://slashdot.org/firehose.pl?op=view&id=450788
- d.
On Jan 3, 2008 9:07 PM, Domas Mituzas midom.lists@gmail.com wrote:
Hello colleagues and shareholders (community :)!
Hi. Fabulous colleague-shareholder-report you have there; but could you fix the transition speed on your powerpoint slides? and the font in the footer...
Rob joining to help us with datacenter operations has allowed to have really nice turnarounds with pretty much every datacenter work - as volunteer remote hands became not available during critical moments anymore.
Is this a general trend -- remote hands not being available during critical moments -- or a chicken-and-egg issue with other elements (such as Rob being around)?
Oh, and look how tidy cabling is: http://flickr.com/photos/midom/2134991985/ !
== Networking ==
This has been mainly in capable Mark's and River's hands - where we underwent transition from hosting customer to internet service provider (or at least - equal peer to ISPs) ourselves. We have our independent autonomous systems both in Europe and US - allowing to pick best available connectivity options, resolve routing glitches, and get free traffic peering at internet exchanges. That provides quite lots of flexibility, of course, at the cost of more work and skills required.
This is cool; I had no idea. Is there a longer description of how it works?
< Though it would
be possible to reach out into multiple donated hosting places, that would just lead to slower service for our users, and someone would still have to pay for the bandwidth. As we are pushing nearly 4 Gbps of traffic, there're not much donors who wouldn't feel such traffic.
Any offers of support from multiple really big donated hosting places?
Of course, MediaWiki is still one of most actively developed web software - and here Brion and Tim lead the volunteers, as well, as spend their days and nights in the code.
Are there recent stats on the # of reusers, sites, contributors; mediawiki extension variants / repositories outside the main tree?
< Quite a lot of problems we hit are very
huge-site-specific, and even if other huge shops hit them, we're the ones that are always free to release our changes and fixes. Still, colleagues from other shops are willing to assist us too :)
Ditto for stats on # and quality of patches from colleagues from other shops. Is there a wall of huge-site-heroes for those who release their patches?
feeling stated, SJ
Hello,
Hi. Fabulous colleague-shareholder-report you have there; but could you fix the transition speed on your powerpoint slides? and the font in the footer...
Transition speed? Powerpoint slides? That is a workbook, not a presentation :) people read it as a pdf book! :)
Is this a general trend -- remote hands not being available during critical moments -- or a chicken-and-egg issue with other elements (such as Rob being around)?
It gets complicated with all operations folks being in Europe - hardware provisioning, reinstalls, etc - is usually managed not from U.S. That limits the time window for various work. Datacenter ops are more fixed in time than anything else - and when we needed things to do ASAP, people were in schools/jobs/etc.
Now with Rob around we can be pretty sure that any critical issue can be dealt with swiftly, and non critical jobs still in reasonable time.
This is cool; I had no idea. Is there a longer description of how it works?
Probably Mark could tell much more, but generally equal providers like to exchange their traffic for free - they hop on to traffic exchanges (think of a huge switch, well, in real world it is bunch of big switches :), and look for peering partners. Usually it is a bit of trouble for content providers to get peering, but Google has it with nearly every major provider. For smaller people like us we have to be really cool. Mark did lots of social work to present us as cool in networking world, and we are allowed to use some free resources.
The press release about such activities in Amsterdam was at: http://wikimediafoundation.org/wiki/AMS-IX
Any offers of support from multiple really big donated hosting places?
The big problem is that it has to have hardware - lots of it - to support our cached data set, and if you try to disperse it over multiple datacenters, really complex problems start like how to balance the requests so they go to datacenter which would have that data. Stuff like 'lets have french go here, germans there' adds lots of administrative work - from maintaining all the platform, to actually troubleshooting.
Generally, more datacenters are there, more probably we'll miss problems. This was especially seen by moving Asian languages to Asia - by having platforms we manage less than Tampa we'd eventually end up with them working slower, more errors, etc. What was interesting - no Asians would come and tell that to us - it seems that everyone is used there to bad international sites. Once we reduced the complexity and did just what was nice, easy to manage and efficient - we ended up having people in those countries tell 'yay it is very fast, faster than local sites'.
So, if someone would come up with big donated place, that would bring as much caching hardware as we have now in Amsterdam, or in Tampa - probably it could be possible to consider. Still, for many of these places to host us would be as expensive as cover our costs. And by using more sites, our costs increase.
Are there recent stats on the # of reusers, sites, contributors; mediawiki extension variants / repositories outside the main tree?
Well, quite a few people commit extensions to our repositories: http://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/
The community outside our #mediawiki cabal has not formed, though there're various extensions floating around, that are more suitable for other sites, than us. And there're lots of small wikipedias or wikimedias all around the world (I run one at job, management never called it 'wiki' - it sounded as old not very nice software, or mediawiki).
Ditto for stats on # and quality of patches from colleagues from other shops. Is there a wall of huge-site-heroes for those who release their patches?
I mean more the knowledge sharing with people who run big non- mediawiki sites. Like, folks from Yahoo helped with APC. Google have nice patches for MySQL. SixApart help with memcached. Even though they don't directly fix our bugs, their engineers are willing to communicate, discuss the operations, and sometimes improve software we use. And in the end they even can buy you a drink. :)
Best regards,
Domas Mituzas a écrit : <snip>
Any offers of support from multiple really big donated hosting places?
The big problem is that it has to have hardware - lots of it - to support our cached data set, and if you try to disperse it over multiple datacenters, really complex problems start like how to balance the requests so they go to datacenter which would have that data. Stuff like 'lets have french go here, germans there' adds lots of administrative work - from maintaining all the platform, to actually troubleshooting.
What about getting sponsored by Akamai ? :o)
Generally, more datacenters are there, more probably we'll miss problems. This was especially seen by moving Asian languages to Asia
- by having platforms we manage less than Tampa we'd eventually end
up with them working slower, more errors, etc. <snip>
One problem is that you can not monitor an application by just using logfiles. You want scripts to check your application servers works as intended, trigger warnings when performances are going lower than expected ... Nagios Ganglia is a first step, you might want more :o)
Having clusters in 3 differents places is no more different that 3 clusters in the same room. You still have to manage 3 entities. I must agree it is usually easier to handle stuff at the same place (1 guy there, file transfers with 10Gb, easier to switch a server from a cluster to another).
<snip>
On Jan 3, 2008 8:07 PM, Domas Mituzas midom.lists@gmail.com wrote:
Hello colleagues and shareholders (community :)!
Has been a while since my last review of operations (aka hosting report) - so I will try to overview some of things we've been doing =) First of all, I'd like to thank mr.Moore for his fabulous law. It allowed Wikipedia to stay alive - even though we had to grow again in all directions.
Truly excellent, as always. Many thanks for the thorough update.
Austin
wikimedia-l@lists.wikimedia.org