Hello everyone,
We are currently experiencing widespread intermittent DNS resolution issues within the Toolforge Kubernetes cluster that began a few hours ago. The WMCS team is actively investigating and working on resolving the issue.
Current impact:
- Intermittent DNS resolution failures across the cluster - Multiple tool deployments experiencing crashes and restarts - Inconsistent job execution - Image pull failures - API connectivity issues
You can follow the ongoing investigation and updates at: https://phabricator.wikimedia.org/T380844
We will send another update once we have more information or when the incident is resolved.
Thank you for your patience,
WMCS Team -- Slavina Stefanova (she/her) Software Engineer | Cloud Services
Wikimedia Foundation
Hello everyone,
Following up on this incident - the situation has stabilized following control plane node reboots at around 10:25 UTC.
*Current status:*
- No new DNS-related failures have been observed since the control plane reboots - Tool deployments and jobs are running normally
While we're still observing some underlying networking warnings, these are not currently impacting service. We will continue monitoring the situation and investigating the root cause to prevent future occurrences.
If you notice any DNS-related issues, please report them in the Phabricator task: https://phabricator.wikimedia.org/T380844
Thank you for your patience during this incident.
Cheers, WMCS Team -- Slavina Stefanova (she/her) Software Engineer | Cloud Services
Wikimedia Foundation
On Tue, Nov 26, 2024 at 11:27 AM Slavina Stefanova sstefanova@wikimedia.org wrote:
Hello everyone,
We are currently experiencing widespread intermittent DNS resolution issues within the Toolforge Kubernetes cluster that began a few hours ago. The WMCS team is actively investigating and working on resolving the issue.
Current impact:
- Intermittent DNS resolution failures across the cluster
- Multiple tool deployments experiencing crashes and restarts
- Inconsistent job execution
- Image pull failures
- API connectivity issues
You can follow the ongoing investigation and updates at: https://phabricator.wikimedia.org/T380844
We will send another update once we have more information or when the incident is resolved.
Thank you for your patience,
WMCS Team
Slavina Stefanova (she/her) Software Engineer | Cloud Services
Wikimedia Foundation
wikitech-l@lists.wikimedia.org