(note this is wrapped to 78 characters. Sorry if it comes out wonky for you)
The points I'm going to cover in this (long) message:
* Why one message instead of five
* Introduction
* Systems availability
* The need for a better templating system and community agreement on
templating systems -- and machine parsing
* Fancruft and how to cope
* How to cope with a message of this length and complexity.
1. Why one message instead of six?
Firstly, let me apologize for taking up so much of everyone's time.
I was once told that most people will not read emails which exceed two
paragraphs.
I find that it is mostly true. My hope is that the people who do actually read
through to the end of this email (or post, if you're the nntp sort) are the
important people, and those who do not are directed to the appropriate place to
read the discussion which will inevitably ensue.
Sending six messages would create six (perhaps) lively threads, and
nobody has time to read nor contribute to all of those threads. I sure don't.
Putting everything here, and then augmenting it with the wiki (more on that
later), ensures that everyone reads all the issues (they do mostly all
interrelate
to some degree). My hope is that most of the discussion will actually occur on
IRC, and that the important details will be worked out and documented on the
wiki.
2. Introduction
My name is Alex Avriette. I've been a systems administrator of one form
or another since the mid nineties. Yeah, that means that I'm a little younger
that some of the wikipedians (I will be 27 on March 20). However, I am a second
generation Unix administrator, and I have been using Solaris since before I was
old enough to actually hold a job, and got my first internet email address at
age 12. Some of you may be able to dig some of that stuff up on google groups
if you're that nosy.
My interest in this project -- mediawiki and wikiepdia -- started to take a
more serious note when the "great power loss crash" occurred. As a systems
admin
who has been in charge of high availability systems, it shocked me how long it
took to recover. It further shocked me that data had been lost, when I had spent
the entire day adding what I felt was useful content.
Honestly, my first contributions were bitching and whining on IRC about "why
isn't anyone fixing the thing?" Among others. Not very productive. After hanging
out a while, I wound up being on-channel when we blew a power strip, and chaper
took notice of me. He and others suggested I mail jwales with my concerns. What
proceeded was a long discussion about how to increase the availability of the
wikipedia (I realize there are other projects involved, allow me to use the term
'wikipedia' to encompass all related wikimedia projects), and an email from me
to jimbo.
I mention this because chaper and Jimbo and I have discussed the availability
of the wikipedia, and also the fact that it has been run largely by developers.
As somebody who has been both a developer and a systems administrator (clever
readers will be able to find my resume online), I can tell you that this is
frequently a very bad idea. That is not to say that developers should not have
the keys to the kingdom, but frequently, a developer does not know that we need
a bigger APC or that we might need APC PDU's, and so on and so forth. I want to
say that I'm not just some kook who joined the list to bitch and moan. I've been
in conversations with the relevant people, and this email is the first step in
the right direction.
3. Systems Availability
So the introduction, while rambling, turns out to be a reasonably good segue
into my next real gripe, which is system availability. Some of the outages we
have seen have been due to really unacceptable things. The colo losing power is
one of them, but not one we can really control too well (more on this in a
minute). A power strip blowing is something we most certainly can avoid. We
discussed the APC PDU's, I think that's a great step. Have we made any progress
on that? We recently also had a "database overload", which we can also avoid.
Let me sort of address these individually.
* Power outage at the colo
Kate says we pay for this. This makes it very hard to tolerate failure of that
magnitude. Since then, we still don't have ariel back up as the master database
server. The solution is multiple collocation centers. The problem with this is
the database, which I'll address in a minute.
* Blowing a power strip
APC PDU. Problem solved. I think we're getting there. I mentioned I'm willing
to go down and help with datacenter-type stuff if need be (I'm in Virginia). The
other thing to consider is to make sure that each server has redundant power
supplies, and that each PSU is connected to a separate power source, and that
those sources are both diesel and battery-backed up. Simple datacenter stuff.
Some of it, we can control, the rest of it is up to the datacenter. I've had
good luck with Savvis as a datacenter, but I don't know how their rates compare
to what we're presently paying.
* Database overload
Well, this, I think is the biggest, ugliest, most political "problem" in this
message. We have *one* master database, and a bunch of slaves. This is fine for
failover, if we can then make a slave a master, and edits can continue. This is
also fine if we are satisified with fail-to-readonly as a solution. I, for one,
am not. But there may be technical problems ahead which preclude
fail-to-promoted-slave-as-master solutions. Here's the ugly part. In the past,
I used PostgreSQL for a database at America Online which received far, far, more
traffic than the wikipedia. It ran, most recently, on a quad 3ghz 32-bit Xeon
machine. At AOL, I embarked on a 2-month project to figure out how to scale past
the "single master database on a single box" problem. Postgres, at version
7.3.2
and 7.4.1 (which we had in production), was not ready. Since then, a product
called Slony ("Slony" is polish for the plural of "Elephant", the
elephant is
the postgres mascot, etc), which allows promotion of slaves to masters. Which
allows the kind of replication that the wikipedia needs. In order to solve the
problem of "a nuke lands on the colo in florida", we need to have
multiple master
databases in multiple places. As Kate says, having a redundant
nameserver somewhere
in Europe doesn't really help if Florida is offline, because they STILL CANT GET
TO THE CONTENT.
Lastly, Oracle has a product called RAC, their Real Application Clusters. I
think that (and no I haven't asked them), they may be willing to *give* us
licenses in exchange for being able to use in marketing data "well the
wikipedia,
which receives x gazillion hits a day uses RAC" and a soundbyte from Jimbo...
Or whatever. Additionally, since the wikipedia is not a commercial use, it may
be legal to just use Oracle for development use, eg the entire wikipedia. Oracle
is slower than postgres. Oracle is slower than mysql. However, it is
more scalable,
and it is more robust. If we're able to run it 64-bit on Opteron and go 4- or
8- way with it, and have, say, 2 masters (one in europe and one in the us), we
might score a major win with site failover and reliability. I'm just throwing
that option out. I think it's possible, but I also think it's as peril-ridden
as the google thing.
And before I forget to mention it, Postgres is *more Free* than mysql. I
understand that mediawiki has been coded with mysql in mind, but it might be
possible to begin work on a database-agnostic version of the software that
actually could plug into postgres and we could test things like
cross-continental
failover.
Another system reliability subject is the lack of disaster-recovery
documentation.
Lack of sufficient network diagrams. Lack of documentation required for us (me)
to start attacking this from a SYSTEMS point of view. I understand how we work
squid, apache, mysql, the slaves, and mediawiki. Cool. But tell me where the
switches are. What models they are. Which nodes are connected to which switches.
I am happy to contribute all this documentation just by talking to people on IRC
(probably Kate, chaper, and jwales). But the fact that it isn't there means
that as we continue to grow (and grow we do), the worse this situation gets and
its damn near impossible to get a wikipedia sized wrench around the roblem and
start tightening stuff down.
4. The need for a better templating system and community agreement on
templating systems -- and machine parsing
I've been doing a lot of work on US Military (specifically USMC) weapons
systems. To me, this means aircraft, missiles, artillery, and small arms. But
as I began to edit them, and add photos, and tables (and wikify things that were
html), I found that others were coming along after me to change them
to different
formats. An example of this would be to use the <nowiki>= header =</nowiki>
notation rather than a wiki table with two cells of width, a simple
Metric: Value
(eg Maximum Operating Ceiling: 60,000ft). The user I was discussing this with
is Rlandmann, who is obviously a wikipedian in very good standing. My personal
preference is for the wiki table, compared to a header with * Foo: Bar. I find
the wiki table to be more machine parseable for starters (I cannot tell you how
many times I've had to write a scraper for something).
What I'm getting at is we have no "Regularized Data Committee" or
something.
We have groups of people, such as Rlandmann, who are into aircraft, and want to
see the [[CH-21 Shawnee]] look like some of his other aircraft articles. I see
that, and I would prefer it look more like the [[GAU-17]]. At any rate, I think
we need to have an open discussion on how data should look across the wikipedia,
because we have so many articles, and so few of them bear a standard format. I
just run into this more than I think others might because I've been editing the
weapons systems so much.
I hate committees, and I think it's a bad idea. However, I want to open
discussion on how to address this problem, because I definitely think if it
hasn't already, it deserves to be elevated to "problem" status.
5. Fancruft and how to cope
Those of you who know me on irc (as keats) will know that I am from time to
time (well, daily, really) shocked at the amount of fancruft in the wikipedia.
Kate makes the excellent argument against my argument,
Fancruft, (n), Information about a subject which does not interest me.
I have no real argument against this. I'll be perfectly honest, I don't like the
Mighty Morphin Power Rangers. I've spent the last couple days collapsing the
fifty or sixty or so articles they had on various "monsters" and
"villains" into
giant page (please see my contributions, I don't want to post them here), until
some user got quite upset at my characterization of "bulk and skull"
as villains.
I never watched the show. Having read the article, they seemed like villains,
and I placed them there. To say nothing of the fact that the articles
I collapsed
into the villains list were identical copies of eachother and poorly written.
But this isn't really about fancruft. Fancruft, unfortunately, has its place
in the pedia. I think probably Darth Vader deserves his own page. But that's
also probably because I happen to like Star Wars a little more than MMPR, and
since this is a community project, that's not my decision to make.
Let me, however, share an anecdote. My first article, as I recall, to the
wikipedia, was [[Treeship]]. I was particularly interested in the treeships from
Dan Simmons' Hyperion book. My initial hope had been that somebody else had done
the research and explained exactly what they were, and what an erg (from the
book, not the term) was, and so on. When I didn't find that, I created the node
with the best information I had. I found out a few months later that somebody
had gone and essentially concatenated all the individual nodes I had made for
characters in the book, like Fehdman Kassad, and Hyperion (planet),
and condensed
them all into the rather convenient place of [[Hyperion Cantos]]. And lo, all
the information required for understanding the series, all four books (five?),
was there, in one page. It was neatly wikified in such a way that I could quite
easily find the information I needed, and an appropriate
<nowiki>{{spoiler}}</nowiki>
warning was added. I feel that we can do that with some of the other sections
of the wikipedia which have sort of "blossomed" into their own sub-pedia of
television or book-fiction. The two that really worry me (and I've been too
afraid to look) are Charmed and Xena -cruft.
There's one more thing to address here, though. Whatever your take on fancruft
is, there's an angle that nobody seems to be considering. As a
database administrator
(oh, did I forget to mention I've been that, too?), I can tell you as tables get
bigger (each page having its own tuple or whatever), indices get clobbered, and
we slow down the ENTIRE WIKIPEDIA because somebody needed to make a node for
Treeship. and Ummon. And so on. Again, I use my own mistakes here, to show that
at one point, I didn't have any real understanding of a way to add
fancruft which
is of cultural value (and in the case, I won't admit that Xena or Charmed have
cultural value, but somebody does, and I don't own the wikipedia). I think that
what I'm getting at here is that fancruft is not just harmful in the way that
it degrades the credibility of the wikipedia (say a user goes to a random node
and gets what I got -- [[Needlenose (Power Rangers)]] -- a MMPR monster created
from "tommy's cactus that he brought home from arizona), but it also degrades
the overall system performance and availability of the wikipedia as a whole. So
maybe we need to take a more serious approach to this Fancruft issue. Maybe it
isn't all me bitching "I don't like Xena or Charmed [or television
much at all]".
Or, somebody could come along and tell me that our database schema is robust and
segmented enough that we can support having sixty power rangers
articles on minor
characters that got blown up after one episode, that we can have sixty xena
villain nodes, and so on and so forth. I for one don't buy it, but I might be
convince-able.
6. How to cope with a message of this length and complexity.
Probably best to post this to the wikipedia. It's probably got RfC content in
it, VfD content in it, and so on. It will initially be a sub-node off my my home
node (I'm a perlmonks veteran, hence the 'node' vernacular). Please comment
there, split out into separate pages as necessary, and so on. NNTP and email
aren't sufficient for discussing all these things in parallel.
I will attempt to lightly wikify. Consider this my permission to wikify
this node to your hearts' content.
http://en.wikipedia.org/wiki/User:Avriette/The_Big_Email
Except, we're having another meltdown.
Error in numRows(): Duplicate entry '2-Avriette/The_Big_Email' for key 2
Backtrace:
* Database.php line 502 calls wfdebugdiebacktrace()
* Database.php line 600 calls databasemysql::numrows()
* LinkCache.php line 151 calls databasemysql::selectfield()
* Title.php line 980 calls linkcache::addlinkobj()
* Skin.php line 1659 calls title::getarticleid()
* SkinTemplate.php line 174 calls skinmonobook::makeurldetails()
* OutputPage.php line 417 calls skinmonobook::outputpage()
* OutputPage.php line 614 calls outputpage::output()
* Database.php line 360 calls outputpage::databaseerror()
* Database.php line 309 calls databasemysql::reportqueryerror()
* Database.php line 885 calls databasemysql::query()
* Article.php line 871 calls databasemysql::insert()
* EditPage.php line 239 calls article::insertnewarticle()
* EditPage.php line 68 calls editpage::editform()
* EditPage.php line 164 calls editpage::edit()
* index.php line 173 calls editpage::submit()
# aja // vim:tw=78:ts=2:et
--
Alex Avriette
avriette(a)gmail.com