This hasn't been done for a while, so I'll try to sum up changes in our operations since November, 2005.
There has been much less insane headless chicken run and we've seen quite steady operation operation (except few hiccups) lately. First of all, we could afford for a while ordering hardware before we were completely overloaded - it was constant tune in previous years. There were lots of system architecture changes lately too - the way how we store data, the way how we serve and cache images, and text.
==Hardware==
One of good news is that we can still stay at same class of database servers, which even are getting much cheaper than before. Database server cost per unit went from $15000 in Jun, 2005 to $12500 in October, 2005, to $9070 in March, 2006. We got four of these servers in March and called them... db1, db2, db3 and db4.
For application environment we did a single $100000 purchase, that provided us with 40 high performance servers (with two dual core opteron processors and 4GB of RAM each). This nearly doubled our CPU capacity, and also provided enough of space for revision storage, in-memory caching, etc.
For our current caching layer expansion we ordered 20 high performance servers (8GB memory, four fast disks, $3300 each), which should appear in production in ~one month. We're investigating possibilities of adding more hardware in Amsterdam cluster. We might end up with 10 additional cache servers there too.
We also purchased $40000-worth of Foundry hardware, based on their BigIron RX-8 platform. We will use that as our highly available core routing layer, as well as connectivity for most demanding servers. As well, this will allow flexible networking with upstream providers.
Our next purchase will be image hosting/archival systems, and now there's still ongoing investigation whether to use our previous approach (big cheap server with lots of big cheap disks), or to deploy some storage appliance.
We reallocated some aging servers to search cluster and other auxiliary, and still continue this practice, so that we'd end up with more homogenous application environment.
==Software==
There were lots of improvements in MediaWiki itself, but additionally Tim and Mark ended up in Squid authors list - changes made in it's code were critical to proper squid performance. We did split database cluster, with English Wikipedia ending up on separate set of boxes. Some of old database servers got their new life being slaves just of few languages, thus compensating lack of memory or fast disk system. Additionally revision storage was moved from our core database boxes to 'external storage clusters', which are our application servers utilizing their idle disks.
In optimization work multiple factors are being worked on. "Make it faster" means not only having more requests per second served, but also reducing response times, and both issues are worked on constantly.
And of course, as always, team has been marvelous ;-) Thanks!
Domas Mituzas wrote:
This hasn't been done for a while, so I'll try to sum up changes in our operations since November, 2005. [...] And of course, as always, team has been marvelous ;-) Thanks!
Thanks to the team for all the work and for your summary.
Reading about so much new hardware and supposedly more free servers, what about reenabling access statistics? The problem of also counting requests from squids is still valid when using pure access logs.
But wouldn't an approach like the one from [[de:Benutzer:LeonWeber/WikiCharts]] be the solution? There a short JS snippet generates a short request to the toolserver which logs this. In theory every request is logged, but because of limitations on the toolserver only every 600th request is logged.
Having two or three dedicated servers for serving a one pixel image and logging its request, wouldn't it be possible to reenable a reliable article view statistics?
Regards, Jürgen
On Sat, Aug 19, 2006 at 12:08:33PM +0200, J?rgen Herz wrote:
Domas Mituzas wrote:
This hasn't been done for a while, so I'll try to sum up changes in our operations since November, 2005. [...] And of course, as always, team has been marvelous ;-) Thanks!
Thanks to the team for all the work and for your summary.
Reading about so much new hardware and supposedly more free servers, what about reenabling access statistics? The problem of also counting requests from squids is still valid when using pure access logs.
But wouldn't an approach like the one from [[de:Benutzer:LeonWeber/WikiCharts]] be the solution? There a short JS snippet generates a short request to the toolserver which logs this. In theory every request is logged, but because of limitations on the toolserver only every 600th request is logged.
Is LeonWeber's talk page still the second most visited page? The problem with these JS tools is that they are easily faked if they use sampling (e.g. logging only 1 in 600 requests).
There's code the squid programmers are working on and which will very likely be in the next patch set for the stable release of squid that allows to specify a remote loghost, apparently using cheap UDP datagrams to send the log entries.
Having two or three dedicated servers for serving a one pixel image and logging its request, wouldn't it be possible to reenable a reliable article view statistics?
Not with pixel sampling, but with remote logging squids, yes. This is being worked on. Give us some time, and enjoy Leon's statistics in the meantime.
Regards,
jens
wikitech-l@lists.wikimedia.org