Another note; we really need to be able to survive a Zwinger downtime much better than we do currently. Sometimes its an avoidable accident; but it might be hardware failure... and we'd like to get Zwinger upgraded some day so we don't have to special-case installations on it for Red Hat 9. Surviving that upgrade would be nice. ;)
Most of the files the web servers need to run (eg, the PHP scripts themselves) are stored on each machine's local disk, and we push out updates. That's good!
Uploaded files are on another server; downtime there too can also be bad, but at least a zwinger down shouldn't be killing those too.
However we are reading a few bits off of zwinger's NFS (some block lists etc, some lock files) and sometimes writing (logs). Insofar as those are currently used they should be either migrated to a more survivable situation or should be able to fail gracefully. NFS should be set up if it's not in a way that will fail cleanly after a short timeout.
Some other configuration files, such as php.ini, and various programs and utilities (one of the perlbals?) are also pulled off of NFS currently. These need to generally be fixed up so that things can continue running while home dirs are down; pushing the files out on update as we do with the PHP scripts is probably in order.
-- brion vibber (brion @ pobox.com)
However we are reading a few bits off of zwinger's NFS (some block lists etc, some lock files) and sometimes writing (logs). Insofar as those are currently used they should be either migrated to a more survivable situation or should be able to fail gracefully. NFS should be set up if it's not in a way that will fail cleanly after a short timeout.
Linux mount option "soft" will cause an I/O error to be returned after a "major timeout," the definition of which varies. "intr" in combination with "hard" will allow the program to respond to signals, which is in most cases preferable to having an uninterruptable process sitting there until reboot.
If you want anything better than that, use iSCSI.
-- Austin
Austin Hair wrote:
However we are reading a few bits off of zwinger's NFS (some block lists etc, some lock files) and sometimes writing (logs). Insofar as those are currently used they should be either migrated to a more survivable situation or should be able to fail gracefully. NFS should be set up if it's not in a way that will fail cleanly after a short timeout.
Linux mount option "soft" will cause an I/O error to be returned after a "major timeout," the definition of which varies. "intr" in combination with "hard" will allow the program to respond to signals, which is in most cases preferable to having an uninterruptable process sitting there until reboot.
We mount NFS with soft and timeo=14. I imagine retrans is at its default value of 3, so if I understand the manual correctly, that gives a major timeout of 9.8 seconds. That would be consistent with what we saw in the crash -- most apps don't seem to abort when they get one of these timeouts, they just treat it as an ordinary read error and continue their execution. It's not surprising that everything locked up, including root logins.
What about using a detachable filesystem like Coda, or a spare NFS server with automatic failover?
-- Tim Starling
Brion Vibber wrote:
However we are reading a few bits off of zwinger's NFS (some block lists etc, some lock files) and sometimes writing (logs). Insofar as those are currently used they should be either migrated to a more survivable situation or should be able to fail gracefully. NFS should be set up if it's not in a way that will fail cleanly after a short timeout.
Hello,
All logs should be sent on a dedicated logging server, I kept ranting about it for the last 2 years. A cool big syslog could host all squids logs, mail logs, apache error logs, various daemon logs etc.. That will let dev find what's wrong at one place (you want redundant servers).
For file update you can push files using scp / ftp instead of a local cp over a NFS mount.
For lock files, I dont know why there is any using the mount anyway. Can't they be on the local file system under /var/lock/ or something?
Mail services (email from wikipedia.org and mailing lists), should probably be moved off of apaches and zwinger to a new dedicated email server).
Ganglia could be moved to Larousse wich already host noc.wikimedia.org website and is(was) used for Nagios. Larousse could become the monitoring device.
Some other configuration files, such as php.ini, and various programs and utilities (one of the perlbals?) are also pulled off of NFS currently. These need to generally be fixed up so that things can continue running while home dirs are down; pushing the files out on update as we do with the PHP scripts is probably in order.
Stop pulling ! Push with scp / sftp :o)
Ideally, someone should list every service provided by zwinger and plan a migration of each services on dedicated servers. Once all services are migrated, zwinger can be reinstalled from scratch using whatever standard distribution in use at that time (FC3?) and then reused for something else (like a dev machine maybe for testing stuff).
cheers,
wikitech-l@lists.wikimedia.org