Alex J. Avriette wrote:
My interest in this project -- mediawiki and wikiepdia -- started to take a more serious note when the "great power loss crash" occurred. As a systems admin who has been in charge of high availability systems, it shocked me how long it took to recover. It further shocked me that data had been lost, when I had spent the entire day adding what I felt was useful content.
What data was lost? As far as I know, nothing at all should have been lost in that incident except perhaps from the last few seconds prior to the crash. If you looked at the site while it was read-only during the playback of logs or shortly thereafter before we cleared the parser cache you might have seen old versions of pages, but that would have all been restored by the end of the day.
I mention this because chaper and Jimbo and I have discussed the availability of the wikipedia, and also the fact that it has been run largely by developers. As somebody who has been both a developer and a systems administrator (clever readers will be able to find my resume online), I can tell you that this is frequently a very bad idea. That is not to say that developers should not have the keys to the kingdom, but frequently, a developer does not know that we need a bigger APC or that we might need APC PDU's, and so on and so forth.
Depends on what you mean by "developers". Many of the "developers" (such as Kate and Jamesday) aren't actually the people touching the MediaWiki code, but are system administrators and DBAs who spend most of their time and effort on running the server farm, arranging the network, ordering our new hardware, database admin, etc.
- Power outage at the colo Kate says we pay for this. This makes it very hard to tolerate failure of that
magnitude. Since then, we still don't have ariel back up as the master database server. The solution is multiple collocation centers.
Well, additional data centers is on its way. :)
Lastly, Oracle has a product called RAC, their Real Application Clusters. I think that (and no I haven't asked them), they may be willing to *give* us licenses in exchange for being able to use in marketing data "well the wikipedia, which receives x gazillion hits a day uses RAC" and a soundbyte from Jimbo...
Oracle is unlikely to happen, even if they pay us to use it. There's a conscious political decision to use FOSS software.
And before I forget to mention it, Postgres is *more Free* than mysql. I understand that mediawiki has been coded with mysql in mind, but it might be possible to begin work on a database-agnostic version of the software that actually could plug into postgres and we could test things like cross-continental failover.
Experimental PostgreSQL support already exists, and will be improving as time goes along.
Another system reliability subject is the lack of disaster-recovery documentation. Lack of sufficient network diagrams. Lack of documentation required for us (me) to start attacking this from a SYSTEMS point of view. I understand how we work squid, apache, mysql, the slaves, and mediawiki. Cool. But tell me where the switches are. What models they are. Which nodes are connected to which switches.
Some of this is on wp.wikidev.net. If it's not, talk to Kate etc and make sure it gets done.
-- brion vibber (brion @ pobox.com)