A few notes:
* We had a pretty, multilingual "down for maintenance" page all set up to be served for requests to the site during the downtime, but this was foiled for three or four hours because our offsite DNS in .nl hadn't been actually set up quite the way we thought it was, and our onsite DNS server in Florida was taken down earlier than planned by mistake.
We did get it mostly working for *.wikimedia.org for some people by partway through the downtime, but were not able to get the other domains (such as wikipedia.org) updated at the time. Once Zwinger was back online in the new rackspace, we had DNS again and the downtime page was visible for the remainder of the time, unless you were unlucky and it didn't work anyway.
* The downtime message was experimentally running Lighttpd+FastCGI instead of Apache. For no apparent reason it stopped understanding its 404 error handler page directive some time in the middle of things, so I switched it to Apache.
* The Paris squids were I think still sending requests to the offline Florida machines instead of the downtime page in .nl. Not totally sure what was the issue here.
* When bringing lots of web server machines online we have an issue with synchronization of time and configuration: the machines are set to automatically start the web server on boot, and the load balancers are set to automatically put work on them when they come up. But some machines have clock trouble and come up in the wrong time, and if the configuration has changed they'll have settings out of sync until changed. We need to resolve this; either by requiring a manual start or by some sort of sanity-check against the master clocks and config.
For massively wrong clocks (eg, BIOS reset to 2003) we can easily sanity check by comparing the current time against $wgCacheEpoch to make sure it's later. :)
* Things are very not happy booting without DNS or the LDAP server up. We should make bloody sure this is not as big of a problem; LDAP needs to be well-replicated, and important internal addresses should be resolvable without DNS.
-- brion vibber (brion @ pobox.com)
Brion Vibber wrote in gmane.science.linguistics.wikipedia.technical:
A few notes:
- We had a pretty, multilingual "down for maintenance" page all set up
to be served for requests to the site during the downtime, but this was foiled for three or four hours because our offsite DNS in .nl hadn't been actually set up quite the way we thought it was, and our onsite DNS server in Florida was taken down earlier than planned by mistake.
regarding the downtime page, i've put it in CVS as /tools/downtime/, because we've had such a notice at least twice now, and it seems wise to have a standard, translated message for use in the future.
[...]
- The Paris squids were I think still sending requests to the offline
Florida machines instead of the downtime page in .nl. Not totally sure what was the issue here.
this shouldn't have mattered, should it? the DNS for * was pointing at nl. ... well, in theory.
- When bringing lots of web server machines online we have an issue with
synchronization of time and configuration: the machines are set to automatically start the web server on boot, and the load balancers are set to automatically put work on them when they come up. But some machines have clock trouble and come up in the wrong time, and if the configuration has changed they'll have settings out of sync until changed. We need to resolve this; either by requiring a manual start or by some sort of sanity-check against the master clocks and config.
For massively wrong clocks (eg, BIOS reset to 2003) we can easily sanity check by comparing the current time against $wgCacheEpoch to make sure it's later. :)
i think it should be possible to have apaches do a scap when booting before starting apache, which would remove the problem with outdated PHP files.
-- brion vibber (brion @ pobox.com)
kate.
Kate Turner wrote in gmane.science.linguistics.wikipedia.technical:
regarding the downtime page, i've put it in CVS as /tools/downtime/, because we've had such a notice at least twice now, and it seems wise to have a standard, translated message for use in the future.
you can preview the current version at http://www.knams.wikimedia.org/downtime/.
kate.
Kate Turner wrote:
you can preview the current version at http://www.knams.wikimedia.org/downtime/.
Dunno who wrote the French version but it sounds like some bad machine translation.
David Monniaux <David.Monniaux <at> ens.fr> writes:
Kate Turner wrote:
you can preview the current version at http://www.knams.wikimedia.org/downtime/.
Dunno who wrote the French version but it sounds like some bad machine translation.
+1
Where's the wiki or CVS I can correct it ?
Kate Turner (keturner@livejournal.com) [050608 22:28]:
Kate Turner wrote in gmane.science.linguistics.wikipedia.technical:
regarding the downtime page, i've put it in CVS as /tools/downtime/, because we've had such a notice at least twice now, and it seems wise to have a standard, translated message for use in the future.
you can preview the current version at http://www.knams.wikimedia.org/downtime/.
Surely the versions in languages other than English should link to mirrors in the language in question, not mirrors of en:?
- d.
David Gerard wrote in gmane.science.linguistics.wikipedia.technical:
you can preview the current version at http://www.knams.wikimedia.org/downtime/.
Surely the versions in languages other than English should link to mirrors in the language in question, not mirrors of en:?
they do where such a mirror was provided by the translators (Danish, Polish).
- d.
kate.
wikitech-l@lists.wikimedia.org