Hello!
Next week we'll be rebuilding and upgrading the hardware that provides DNS service to cloud-vps and toolforge. These rebuilds will start at 14:00 UTC and the whole process may take 2-3 hours. It's likely that DNS lookups will be somewhat slower as clients fail over between the in-progress and the working server. In theory there should be few other user-facing effects from these upgrades.
In practice, though, this isn't something that we've done for quite a while, and touching DNS is always risky since it underlies pretty much everything. Here are some things to be ready for:
- As a precaution we'll be disabling Horizon during the window to prevent new VMs or DNS changes landing in an inconsistent state.
- Some badly-behaved DNS clients won't fail over properly and will report errors when their primary DNS server is down.
- Puppet will almost certainly experience transient failures, since Puppet is known to be one of those badly-behaved clients.
- If things go very badly there may be periods of total DNS outage which will result in many WMCS-hosted services failing. There's no particular reason that this /should/ happen, but this is the worst-case scenario.
For additional context, the phabricator task for this work is https://phabricator.wikimedia.org/T253780
- Andrew + the WMCS team
Reminder: This maintenance is starting in about an hour.
On 6/2/20 8:01 AM, Andrew Bogott wrote:
Hello!
Next week we'll be rebuilding and upgrading the hardware that provides DNS service to cloud-vps and toolforge. These rebuilds will start at 14:00 UTC and the whole process may take 2-3 hours. It's likely that DNS lookups will be somewhat slower as clients fail over between the in-progress and the working server. In theory there should be few other user-facing effects from these upgrades.
In practice, though, this isn't something that we've done for quite a while, and touching DNS is always risky since it underlies pretty much everything. Here are some things to be ready for:
- As a precaution we'll be disabling Horizon during the window to
prevent new VMs or DNS changes landing in an inconsistent state.
- Some badly-behaved DNS clients won't fail over properly and will
report errors when their primary DNS server is down.
- Puppet will almost certainly experience transient failures, since
Puppet is known to be one of those badly-behaved clients.
- If things go very badly there may be periods of total DNS outage
which will result in many WMCS-hosted services failing. There's no particular reason that this /should/ happen, but this is the worst-case scenario.
For additional context, the phabricator task for this work is https://phabricator.wikimedia.org/T253780
- Andrew + the WMCS team
This is done. Other than Horizon being disabled there was no service interruption during the upgrade.
-Andrew + the WMCS team
On 6/9/20 7:52 AM, Andrew Bogott wrote:
Reminder: This maintenance is starting in about an hour.
On 6/2/20 8:01 AM, Andrew Bogott wrote:
Hello!
Next week we'll be rebuilding and upgrading the hardware that provides DNS service to cloud-vps and toolforge. These rebuilds will start at 14:00 UTC and the whole process may take 2-3 hours. It's likely that DNS lookups will be somewhat slower as clients fail over between the in-progress and the working server. In theory there should be few other user-facing effects from these upgrades.
In practice, though, this isn't something that we've done for quite a while, and touching DNS is always risky since it underlies pretty much everything. Here are some things to be ready for:
- As a precaution we'll be disabling Horizon during the window to
prevent new VMs or DNS changes landing in an inconsistent state.
- Some badly-behaved DNS clients won't fail over properly and will
report errors when their primary DNS server is down.
- Puppet will almost certainly experience transient failures, since
Puppet is known to be one of those badly-behaved clients.
- If things go very badly there may be periods of total DNS outage
which will result in many WMCS-hosted services failing. There's no particular reason that this /should/ happen, but this is the worst-case scenario.
For additional context, the phabricator task for this work is https://phabricator.wikimedia.org/T253780
- Andrew + the WMCS team
cloud-announce@lists.wikimedia.org