Another power failure at the colo today. I'm not too sure of details yet as this happened in the middle of the night for me.
Admin log: https://wikitech.leuksman.com/view/Server_admin_log#April_19
PowerMedium is very very very very very very sorry and blames the equipment manufacturer; the defective equipment is apparently being replaced.
Note that the server that carries the data dump files is currently offline. If we don't have it back real soon now I'll restart them on another server.
-- brion vibber (brion @ pobox.com)
Brion Vibber wrote:
Another power failure at the colo today. I'm not too sure of details yet as this happened in the middle of the night for me.
Admin log: https://wikitech.leuksman.com/view/Server_admin_log#April_19
PowerMedium is very very very very very very sorry and blames the equipment manufacturer; the defective equipment is apparently being replaced.
Note that the server that carries the data dump files is currently offline. If we don't have it back real soon now I'll restart them on another server.
Happened in the middle of the night for me, too, but I happened to be editing at the time. (When I'm making template changes, I try to do them at the lowest load, so they don't interfere.)
From outside, it appeared to be on the hour (within seconds)! What task are they doing that might have tripped the circuit?
Also, you seem to have lost all routing and BGP announcements (at least as seen from here), although I had no problem getting DNS (presumably outside replication). There's a single point of failure there. There should have been failover to another cluster.
While the main folks are fixing things (I assume they are very busy now), could somebody else point me at documentation about the setup?
William Allen Simpson wrote:
Also, you seem to have lost all routing and BGP announcements (at least as seen from here), although I had no problem getting DNS (presumably outside replication). There's a single point of failure there. There should have been failover to another cluster.
Routing/BGP is the colo's responsibility.
While the main folks are fixing things (I assume they are very busy now),
Most things were fixed hours ago; nobody bothered to send out a message until I woke up. :P
could somebody else point me at documentation about the setup?
Various docs (not all outdated) on our setup: https://wikitech.leuksman.com/view/Main_Page
-- brion vibber (brion @ pobox.com)
On 19/04/06, William Allen Simpson william.allen.simpson@gmail.com wrote:
While the main folks are fixing things (I assume they are very busy now), could somebody else point me at documentation about the setup?
To someone who is still an outsider when it comes to the operating part of Wikimedia, I think half the problem a lot of the time is a lack of proper internal documentation. On a project where you have a lot of volunteer staff doing things ad hoc some of the time, it's essential that things are documented; so that if Domas or Mark aren't available, then Brion and Jens could bring up the squids without needing to have been the people who did the configuration [random example there].
No doubt people will be quick to point out the existence of http://wikitech.leuskman.com, which I counter with the argument that at least some of that information has to be out of date, and some of it is marked as such. I understand there is a private mailing list where discussion of core issues (and sensitive ones, I assume) takes place, but basics ought to be documented somewhere, and there's a lack of it.
The entire project at large suffers from a lack of documentation; some of the MediaWiki documentation, such as it is, is still stuck on Meta, and needs shifting across to MediaWiki.org. Vast quantities of pages need to be audited and brought up to date with what is current.
We're talking about a group of people volunteering their skills from all over the globe; at different levels and in different fields. We've got Aussies, and we've got Europeans and Californians. Those people can't always be available to communicate exactly what they're doing. I see no reason Wikimedia's cluster can't continue in this vein. I see every reason why it's going to become even more unmanageable than it is now without some more organisation.
This rant isn't directed at the staff so much as the system as a whole.
Rob Church
wikitech-l@lists.wikimedia.org