[Foundation-l] State of technology: 2007
Domas Mituzas
midom.lists at gmail.com
Fri Jan 4 02:07:04 UTC 2008
Hello colleagues and shareholders (community :)!
Has been a while since my last review of operations (aka hosting
report) - so I will try to overview some of things we've been doing =)
First of all, I'd like to thank mr.Moore for his fabulous law. It
allowed Wikipedia to stay alive - even though we had to grow again in
all directions.
We still have Septembers. Well, it is a nice name to describe the
recurring pattern, which provides Shock and Awe to us - after a
period of stable usage, every autumn number of users suddenly goes up
and stays there - to allow us think we've finally reached some
saturation and will never grow more. Until next September.
We still have World Events. People rush to us to read about conflicts
and tragedies, joys and celebrations. Sometimes because we had
information for ages, sometimes because it all matured in seconds or
minutes. Nowhere else document can require that much of concurrent
collaboration, and nowhere else it can provide as much value
immediately.
We still have history. From day one of the project, we can see people
going into dramas, discussing, evolving and revolving every idea on
the site. Every edit stays there - accumulating not only final pieces
of information, but the whole process of assembling the content.
We still advance. Tools to facilitate the community get more complex,
we start growing ecosystem of tools and processes inside and outside
core software and platform. Users are the actual developers of the
project, core technology just lags behind assisting.
Our operation becomes more and more demanding - and thats quite a bit
of work to handle.
Ok, enough of such poetic introduction :)
== Growth ==
Over second half of 2006 traffic and reqeuests to our cluster doubled
(actually, that happened just in few months)
Over 2007 traffic and requests to our cluster doubled.
Pics:
http://www.nedworks.org/~mark/reqstats/trafficstats-yearly.png
http://www.nedworks.org/~mark/reqstats/reqstats-yearly.png
== Hardware expansion ==
Back in September 2006 we had quite huge load increase, and we went
for capacity expansion, which included:
* 20 new Squid servers ($66k)
* 2 storage servers ($24k)
* 60 application servers ($232k)
German foundation additionally assisted with purchasing 15 Squid
servers in November for Amsterdam facility.
Later in January 2007 we added 6 more database servers (for $39k),
three additional application servers for auxiliary tasks (such as
mail), and some network and datacenter gear.
The growth over autumn/winter led us to quite big ($240k) capacity
expansion back in March, which included:
* 36 very capable 8-core application servers (thank you Moore yet
again :) - that was around $120k
* 20 Squid servers for Tampa facility
* Router for Amsterdam facility
* Additional networking gear (switches, linecards, etc) for Tampa
The only serious capacity increase afterwards was another
'German' (thanks yet again, Verein) batch of 15 Squid servers for
Amsterdam in December 2007.
We do plan to improve on database and storage servers soon - that
would add to stability of our dumps building and processing, as well
as better support for various batch jobs.
We have been especially pushy about exploiting warranties on all
servers, and nearly all machines ever purchased are in working state,
doing one or another kind of workload. All the veterans of 2005 are
still running at amazing speeds doing the important jobs :)
Rob joining to help us with datacenter operations has allowed to have
really nice turnarounds with pretty much every datacenter work - as
volunteer remote hands became not available during critical moments
anymore. Oh, and look how tidy cabling is: http://flickr.com/photos/
midom/2134991985/ !
== Networking ==
This has been mainly in capable Mark's and River's hands - where we
underwent transition from hosting customer to internet service
provider (or at least - equal peer to ISPs) ourselves. We have our
independent autonomous systems both in Europe and US - allowing to
pick best available connectivity options, resolve routing glitches,
and get free traffic peering at internet exchanges. That provides
quite lots of flexibility, of course, at the cost of more work and
skills required.
This is also part of overall well-managed powerful datacenter
strategy. Instead of low-efficiency small datacenters scattered
around the world, core facility like one in Amsterdam provides high
availability, close proximity to major Internet hubs and carriers,
and is generally in center of region's inter-tubes. Though it would
be possible to reach out into multiple donated hosting places, that
would just lead to slower service for our users, and someone would
still have to pay for the bandwidth. As we are pushing nearly 4 Gbps
of traffic, there're not much donors who wouldn't feel such traffic.
== Software ==
There has been lots of overall engineering effort, that was often
behind the scenes. Various bits had to be rewritten to act properly
on user activity. The most prominent example of such work is Tim's
rewrite of parser to more efficiently handle huge template
hierarchies. In perfect case, users will not see any visible change,
except multiple-factor faster performance at expensive operations.
In past year, lots of activities - how people use customized software
- bots, javascript extensions, etc - have changed performance
profile, and nowadays lots of performance work at backend is to
handle various fresh activities - and anomalies.
One of core activities was polishing caching of our content, so we
could have our application layer to concentrate on most important
process - collaboration, instead of content delivery.
Lots and lots of small things have been added or fixed - though some
developments where quite demanding - like multimedia integration,
which was challenging due to our freedom requirements.
Still, there was constant tradeoff management, as not every feature
was worth the performance sacrifice and costs, and on the other hand
- having the best possible software for collaboration is also
important :) Introducing new features, or migrating them from outside
to the core platform has been always serious engineering effort.
Besides, there would be quite a lot of communication - explaining how
things have to be built for them not to collapse at live site,
discussing security implications, change of usage patterns, ...
Of course, MediaWiki is still one of most actively developed web
software - and here Brion and Tim lead the volunteers, as well, as
spend their days and nights in the code.
At the overall stack, we have worked at every layer - tuning kernels
for our high-performance networking, experimenting with database
software (some servers are running our own fork of MySQL, based on
Google changes), perfecting Squid (Mark and Tim ended up in authors
list) - our web caching software, digging into problems and
specialties of PHP engine. Quite a lot of problems we hit are very
huge-site-specific, and even if other huge shops hit them, we're the
ones that are always free to release our changes and fixes. Still,
colleagues from other shops are willing to assist us too :)
There were lots of tiny architecture tweaks - that allowed us to use
resources more efficiently, but none of them are any major - pure
engineering all the time. It seems, that lately we stabilized on lots
of things in how Wikipedia works - and it seems to work quite
fluently. Of course, one must mention Jens' keen eye, taking care of
various especially important but easily overlooked things.
River has dedicated lots of attention to supporting the community
tools infrastructure at the Toolserver - and also maintaining off-
site copies of projects.
Site doesn't fall down the very same minute nobody is looking at it,
and it is quite an improvement over the years :)
== Notes ==
People have been discussing if running a popular site is really a
mission of WMF. Well, users created magnificent resource, we try to
support it, we do what we can. Thanks to everyone involved - though
it has been far less stressful ride than previous years, still, nice
work. ;-)
== More reading ==
May hurt your eyes: https://wikitech.leuksman.com/view/Server_admin_log
Platform description: http://dammit.lt/uc/workbook2007.pdf
== Disclaimer ==
Some numbers can be wrong, as this review was based not on audit, but
on vague memories :)
--
Domas Mituzas -- http://dammit.lt/ -- [[user:midom]]
More information about the foundation-l
mailing list