(note this is wrapped to 78 characters. Sorry if it comes out wonky for you)
The points I'm going to cover in this (long) message: * Why one message instead of five * Introduction * Systems availability * The need for a better templating system and community agreement on templating systems -- and machine parsing * Fancruft and how to cope * How to cope with a message of this length and complexity.
1. Why one message instead of six?
Firstly, let me apologize for taking up so much of everyone's time. I was once told that most people will not read emails which exceed two paragraphs. I find that it is mostly true. My hope is that the people who do actually read through to the end of this email (or post, if you're the nntp sort) are the important people, and those who do not are directed to the appropriate place to read the discussion which will inevitably ensue.
Sending six messages would create six (perhaps) lively threads, and nobody has time to read nor contribute to all of those threads. I sure don't. Putting everything here, and then augmenting it with the wiki (more on that later), ensures that everyone reads all the issues (they do mostly all interrelate to some degree). My hope is that most of the discussion will actually occur on IRC, and that the important details will be worked out and documented on the wiki.
2. Introduction
My name is Alex Avriette. I've been a systems administrator of one form or another since the mid nineties. Yeah, that means that I'm a little younger that some of the wikipedians (I will be 27 on March 20). However, I am a second generation Unix administrator, and I have been using Solaris since before I was old enough to actually hold a job, and got my first internet email address at age 12. Some of you may be able to dig some of that stuff up on google groups if you're that nosy.
My interest in this project -- mediawiki and wikiepdia -- started to take a more serious note when the "great power loss crash" occurred. As a systems admin who has been in charge of high availability systems, it shocked me how long it took to recover. It further shocked me that data had been lost, when I had spent the entire day adding what I felt was useful content.
Honestly, my first contributions were bitching and whining on IRC about "why isn't anyone fixing the thing?" Among others. Not very productive. After hanging out a while, I wound up being on-channel when we blew a power strip, and chaper took notice of me. He and others suggested I mail jwales with my concerns. What proceeded was a long discussion about how to increase the availability of the wikipedia (I realize there are other projects involved, allow me to use the term 'wikipedia' to encompass all related wikimedia projects), and an email from me to jimbo.
I mention this because chaper and Jimbo and I have discussed the availability of the wikipedia, and also the fact that it has been run largely by developers. As somebody who has been both a developer and a systems administrator (clever readers will be able to find my resume online), I can tell you that this is frequently a very bad idea. That is not to say that developers should not have the keys to the kingdom, but frequently, a developer does not know that we need a bigger APC or that we might need APC PDU's, and so on and so forth. I want to say that I'm not just some kook who joined the list to bitch and moan. I've been in conversations with the relevant people, and this email is the first step in the right direction.
3. Systems Availability
So the introduction, while rambling, turns out to be a reasonably good segue into my next real gripe, which is system availability. Some of the outages we have seen have been due to really unacceptable things. The colo losing power is one of them, but not one we can really control too well (more on this in a minute). A power strip blowing is something we most certainly can avoid. We discussed the APC PDU's, I think that's a great step. Have we made any progress on that? We recently also had a "database overload", which we can also avoid. Let me sort of address these individually.
* Power outage at the colo Kate says we pay for this. This makes it very hard to tolerate failure of that magnitude. Since then, we still don't have ariel back up as the master database server. The solution is multiple collocation centers. The problem with this is the database, which I'll address in a minute.
* Blowing a power strip APC PDU. Problem solved. I think we're getting there. I mentioned I'm willing to go down and help with datacenter-type stuff if need be (I'm in Virginia). The other thing to consider is to make sure that each server has redundant power supplies, and that each PSU is connected to a separate power source, and that those sources are both diesel and battery-backed up. Simple datacenter stuff. Some of it, we can control, the rest of it is up to the datacenter. I've had good luck with Savvis as a datacenter, but I don't know how their rates compare to what we're presently paying.
* Database overload Well, this, I think is the biggest, ugliest, most political "problem" in this message. We have *one* master database, and a bunch of slaves. This is fine for failover, if we can then make a slave a master, and edits can continue. This is also fine if we are satisified with fail-to-readonly as a solution. I, for one, am not. But there may be technical problems ahead which preclude fail-to-promoted-slave-as-master solutions. Here's the ugly part. In the past, I used PostgreSQL for a database at America Online which received far, far, more traffic than the wikipedia. It ran, most recently, on a quad 3ghz 32-bit Xeon machine. At AOL, I embarked on a 2-month project to figure out how to scale past the "single master database on a single box" problem. Postgres, at version 7.3.2 and 7.4.1 (which we had in production), was not ready. Since then, a product called Slony ("Slony" is polish for the plural of "Elephant", the elephant is the postgres mascot, etc), which allows promotion of slaves to masters. Which allows the kind of replication that the wikipedia needs. In order to solve the problem of "a nuke lands on the colo in florida", we need to have multiple master databases in multiple places. As Kate says, having a redundant nameserver somewhere in Europe doesn't really help if Florida is offline, because they STILL CANT GET TO THE CONTENT.
Lastly, Oracle has a product called RAC, their Real Application Clusters. I think that (and no I haven't asked them), they may be willing to *give* us licenses in exchange for being able to use in marketing data "well the wikipedia, which receives x gazillion hits a day uses RAC" and a soundbyte from Jimbo... Or whatever. Additionally, since the wikipedia is not a commercial use, it may be legal to just use Oracle for development use, eg the entire wikipedia. Oracle is slower than postgres. Oracle is slower than mysql. However, it is more scalable, and it is more robust. If we're able to run it 64-bit on Opteron and go 4- or 8- way with it, and have, say, 2 masters (one in europe and one in the us), we might score a major win with site failover and reliability. I'm just throwing that option out. I think it's possible, but I also think it's as peril-ridden as the google thing.
And before I forget to mention it, Postgres is *more Free* than mysql. I understand that mediawiki has been coded with mysql in mind, but it might be possible to begin work on a database-agnostic version of the software that actually could plug into postgres and we could test things like cross-continental failover.
Another system reliability subject is the lack of disaster-recovery documentation. Lack of sufficient network diagrams. Lack of documentation required for us (me) to start attacking this from a SYSTEMS point of view. I understand how we work squid, apache, mysql, the slaves, and mediawiki. Cool. But tell me where the switches are. What models they are. Which nodes are connected to which switches. I am happy to contribute all this documentation just by talking to people on IRC (probably Kate, chaper, and jwales). But the fact that it isn't there means that as we continue to grow (and grow we do), the worse this situation gets and its damn near impossible to get a wikipedia sized wrench around the roblem and start tightening stuff down.
4. The need for a better templating system and community agreement on templating systems -- and machine parsing
I've been doing a lot of work on US Military (specifically USMC) weapons systems. To me, this means aircraft, missiles, artillery, and small arms. But as I began to edit them, and add photos, and tables (and wikify things that were html), I found that others were coming along after me to change them to different formats. An example of this would be to use the <nowiki>= header =</nowiki> notation rather than a wiki table with two cells of width, a simple Metric: Value (eg Maximum Operating Ceiling: 60,000ft). The user I was discussing this with is Rlandmann, who is obviously a wikipedian in very good standing. My personal preference is for the wiki table, compared to a header with * Foo: Bar. I find the wiki table to be more machine parseable for starters (I cannot tell you how many times I've had to write a scraper for something).
What I'm getting at is we have no "Regularized Data Committee" or something. We have groups of people, such as Rlandmann, who are into aircraft, and want to see the [[CH-21 Shawnee]] look like some of his other aircraft articles. I see that, and I would prefer it look more like the [[GAU-17]]. At any rate, I think we need to have an open discussion on how data should look across the wikipedia, because we have so many articles, and so few of them bear a standard format. I just run into this more than I think others might because I've been editing the weapons systems so much.
I hate committees, and I think it's a bad idea. However, I want to open discussion on how to address this problem, because I definitely think if it hasn't already, it deserves to be elevated to "problem" status.
5. Fancruft and how to cope
Those of you who know me on irc (as keats) will know that I am from time to time (well, daily, really) shocked at the amount of fancruft in the wikipedia. Kate makes the excellent argument against my argument,
Fancruft, (n), Information about a subject which does not interest me.
I have no real argument against this. I'll be perfectly honest, I don't like the Mighty Morphin Power Rangers. I've spent the last couple days collapsing the fifty or sixty or so articles they had on various "monsters" and "villains" into giant page (please see my contributions, I don't want to post them here), until some user got quite upset at my characterization of "bulk and skull" as villains. I never watched the show. Having read the article, they seemed like villains, and I placed them there. To say nothing of the fact that the articles I collapsed into the villains list were identical copies of eachother and poorly written.
But this isn't really about fancruft. Fancruft, unfortunately, has its place in the pedia. I think probably Darth Vader deserves his own page. But that's also probably because I happen to like Star Wars a little more than MMPR, and since this is a community project, that's not my decision to make.
Let me, however, share an anecdote. My first article, as I recall, to the wikipedia, was [[Treeship]]. I was particularly interested in the treeships from Dan Simmons' Hyperion book. My initial hope had been that somebody else had done the research and explained exactly what they were, and what an erg (from the book, not the term) was, and so on. When I didn't find that, I created the node with the best information I had. I found out a few months later that somebody had gone and essentially concatenated all the individual nodes I had made for characters in the book, like Fehdman Kassad, and Hyperion (planet), and condensed them all into the rather convenient place of [[Hyperion Cantos]]. And lo, all the information required for understanding the series, all four books (five?), was there, in one page. It was neatly wikified in such a way that I could quite easily find the information I needed, and an appropriate <nowiki>{{spoiler}}</nowiki> warning was added. I feel that we can do that with some of the other sections of the wikipedia which have sort of "blossomed" into their own sub-pedia of television or book-fiction. The two that really worry me (and I've been too afraid to look) are Charmed and Xena -cruft.
There's one more thing to address here, though. Whatever your take on fancruft is, there's an angle that nobody seems to be considering. As a database administrator (oh, did I forget to mention I've been that, too?), I can tell you as tables get bigger (each page having its own tuple or whatever), indices get clobbered, and we slow down the ENTIRE WIKIPEDIA because somebody needed to make a node for Treeship. and Ummon. And so on. Again, I use my own mistakes here, to show that at one point, I didn't have any real understanding of a way to add fancruft which is of cultural value (and in the case, I won't admit that Xena or Charmed have cultural value, but somebody does, and I don't own the wikipedia). I think that what I'm getting at here is that fancruft is not just harmful in the way that it degrades the credibility of the wikipedia (say a user goes to a random node and gets what I got -- [[Needlenose (Power Rangers)]] -- a MMPR monster created from "tommy's cactus that he brought home from arizona), but it also degrades the overall system performance and availability of the wikipedia as a whole. So maybe we need to take a more serious approach to this Fancruft issue. Maybe it isn't all me bitching "I don't like Xena or Charmed [or television much at all]". Or, somebody could come along and tell me that our database schema is robust and segmented enough that we can support having sixty power rangers articles on minor characters that got blown up after one episode, that we can have sixty xena villain nodes, and so on and so forth. I for one don't buy it, but I might be convince-able.
6. How to cope with a message of this length and complexity.
Probably best to post this to the wikipedia. It's probably got RfC content in it, VfD content in it, and so on. It will initially be a sub-node off my my home node (I'm a perlmonks veteran, hence the 'node' vernacular). Please comment there, split out into separate pages as necessary, and so on. NNTP and email aren't sufficient for discussing all these things in parallel.
I will attempt to lightly wikify. Consider this my permission to wikify this node to your hearts' content.
http://en.wikipedia.org/wiki/User:Avriette/The_Big_Email
Except, we're having another meltdown.
Error in numRows(): Duplicate entry '2-Avriette/The_Big_Email' for key 2
Backtrace:
* Database.php line 502 calls wfdebugdiebacktrace() * Database.php line 600 calls databasemysql::numrows() * LinkCache.php line 151 calls databasemysql::selectfield() * Title.php line 980 calls linkcache::addlinkobj() * Skin.php line 1659 calls title::getarticleid() * SkinTemplate.php line 174 calls skinmonobook::makeurldetails() * OutputPage.php line 417 calls skinmonobook::outputpage() * OutputPage.php line 614 calls outputpage::output() * Database.php line 360 calls outputpage::databaseerror() * Database.php line 309 calls databasemysql::reportqueryerror() * Database.php line 885 calls databasemysql::query() * Article.php line 871 calls databasemysql::insert() * EditPage.php line 239 calls article::insertnewarticle() * EditPage.php line 68 calls editpage::editform() * EditPage.php line 164 calls editpage::edit() * index.php line 173 calls editpage::submit()
# aja // vim:tw=78:ts=2:et
Alex J. Avriette wrote:
My interest in this project -- mediawiki and wikiepdia -- started to take a more serious note when the "great power loss crash" occurred. As a systems admin who has been in charge of high availability systems, it shocked me how long it took to recover. It further shocked me that data had been lost, when I had spent the entire day adding what I felt was useful content.
What data was lost? As far as I know, nothing at all should have been lost in that incident except perhaps from the last few seconds prior to the crash. If you looked at the site while it was read-only during the playback of logs or shortly thereafter before we cleared the parser cache you might have seen old versions of pages, but that would have all been restored by the end of the day.
I mention this because chaper and Jimbo and I have discussed the availability of the wikipedia, and also the fact that it has been run largely by developers. As somebody who has been both a developer and a systems administrator (clever readers will be able to find my resume online), I can tell you that this is frequently a very bad idea. That is not to say that developers should not have the keys to the kingdom, but frequently, a developer does not know that we need a bigger APC or that we might need APC PDU's, and so on and so forth.
Depends on what you mean by "developers". Many of the "developers" (such as Kate and Jamesday) aren't actually the people touching the MediaWiki code, but are system administrators and DBAs who spend most of their time and effort on running the server farm, arranging the network, ordering our new hardware, database admin, etc.
- Power outage at the colo Kate says we pay for this. This makes it very hard to tolerate failure of that
magnitude. Since then, we still don't have ariel back up as the master database server. The solution is multiple collocation centers.
Well, additional data centers is on its way. :)
Lastly, Oracle has a product called RAC, their Real Application Clusters. I think that (and no I haven't asked them), they may be willing to *give* us licenses in exchange for being able to use in marketing data "well the wikipedia, which receives x gazillion hits a day uses RAC" and a soundbyte from Jimbo...
Oracle is unlikely to happen, even if they pay us to use it. There's a conscious political decision to use FOSS software.
And before I forget to mention it, Postgres is *more Free* than mysql. I understand that mediawiki has been coded with mysql in mind, but it might be possible to begin work on a database-agnostic version of the software that actually could plug into postgres and we could test things like cross-continental failover.
Experimental PostgreSQL support already exists, and will be improving as time goes along.
Another system reliability subject is the lack of disaster-recovery documentation. Lack of sufficient network diagrams. Lack of documentation required for us (me) to start attacking this from a SYSTEMS point of view. I understand how we work squid, apache, mysql, the slaves, and mediawiki. Cool. But tell me where the switches are. What models they are. Which nodes are connected to which switches.
Some of this is on wp.wikidev.net. If it's not, talk to Kate etc and make sure it gets done.
-- brion vibber (brion @ pobox.com)
wikitech-l@lists.wikimedia.org