The NFS servers used for scratch and maps mounts (/data/project and /home in the maps project and /data/scratch in other projects) will be going offline for a short time tomorrow 2021-07-01 at around 1600 UTC to move the mounts to DRBD synced volumes. The current setup causes odd issues during failover including data loss and stale files left behind. The process taking place is one of those failovers so there may be some files that were previously deleted that need deleting again present and similar anomalies.
I plan to reboot the maps project servers to make sure they have their mounts and processes restored as best as possible. The scratch mounts should be less impactful. If you use scratch, just be aware that it will go offline for a bit and will be back with some possible quirks. After that, the data should become far more stable and properly synced between the two systems. The process could start later than 1600 UTC if there are sync issues initially as I try to get as much of the data as possible transferred.
More details here https://phabricator.wikimedia.org/T224747 <https://phabricator.wikimedia.org/T224747>
Brooke Storm
Staff SRE
Wikimedia Cloud Services
bstorm(a)wikimedia.org
Next Tuesday we will be upgrading Kubernetes on toolforge. As part of
the upgrade we will need to restart all pods. This will produce a brief
interruption in web services and other tools that use kubernetes.
Assuming your services are able to survive a restart, no action should
be needed on your part. I'll send a further email when the upgrade is
finished.
Special thanks to volunteer Taavi (aka Majavah) who has been essential
in preparing for this upgrade and will be taking time out of his day to
make sure the upgrade goes smoothly on Tuesday.
-Andrew + the WMCS team
Software that uses /data/scratch may see some disruption tomorrow when I merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/695447/ <https://gerrit.wikimedia.org/r/c/operations/puppet/+/695447/> around 20:00 UTC
In order to make the NFS it works on more resilient and healthy, the cluster it runs on is migrating to DRBD failover like Tools home and project uses. The current setup is quite broken.
If you are running something against scratch when the patch becomes active in your project, you may see some issues like a stale NFS handle. If that doesn’t resolve quickly, please let us know in Libera.chat: #wikimedia-cloud
Brooke Storm
Staff SRE
Wikimedia Cloud Services
bstorm(a)wikimedia.org
TL;DR:
* The #wikimedia-cloud IRC channel is moving from Freenode to Libera.Chat.
* Register an account on Libera.Chat and join us there!
There has been a lot of activity over the last 2-3 days related to
staffing changes on the Freenode IRC network [0]. The Wikimedia IRC
Group Contacts (GCs) [1] evaluated the situation and decided that
moving the Wikimedia IRC channels from Freenode to the brand new
Libera.Chat IRC network [2] would be the best course of action [3].
So, we are moving!
A new #wikimedia-cloud channel has been created on irc.libera.chat for
this Wikimedia sub-community to use. The old channel on Freenode still
exists and will be maintained at least until we can get all the bots
moved, our documentation updated on wikitech, and we see more folks on
the Libera channel than the Freenode one. Messages to our channel in
either IRC network, as well as the Telegram channel [4], will be seen
on all other channels.
There is a new subpage on meta [5] for information on how to create a
new account for yourself on Libera.Chat and other related information.
There is also a tracking task [6] that you can look at to see various
activities that the community hopes to take action on to complete the
migration.
One last thing: The #wmhack Freenode channel is bridged to
#wikimedia-hackathon on Libera.Chat. The new channel name will make it
easier for the GCs to help manage spam and other issues that come up
occasionally on IRC. Don't miss the fun of our 2021 virtual hackathon
from Friday, May 21st to Sunday, May 23rd! [7]
[0]: https://www.kline.sh/
[1]: https://meta.wikimedia.org/wiki/IRC/Group_Contacts
[2]: https://libera.chat/
[3]: https://meta.wikimedia.org/w/index.php?diff=21476411
[4]: https://t.me/wmcloudirc
[5]: https://meta.wikimedia.org/wiki/IRC/Migrating_to_Libera_Chat
[6]: https://phabricator.wikimedia.org/T283247
[7]: https://www.mediawiki.org/wiki/Wikimedia_Hackathon_2021
Bryan, on behalf of the WMCS team and the Cloud VPS and Toolforge admins
--
Bryan Davis Technical Engagement Wikimedia Foundation
Principal Software Engineer Boise, ID USA
[[m:User:BDavis_(WMF)]] irc: bd808
Hello there,
We will be doing an upgrade to the CloudVPS edge network Thursday 2021-05-06 @
15:00 UTC that will likely impact user experience, including Toolforge.
We scheduled an 1h operation window. During that time, intermittent network
interruption, packet loss and other network problems are to be expected.
The edge network maintenance will affect how virtual machines (and Toolforge
tools) contact NFS, wiki-replicas, wikis API endpoints, and, in general, any
network traffic that flows leaving or entering the cloud (also known as
north-south traffic).
More information on the operation can be found in phabricator [0] and in
wikitech [1].
Regards.
[0]https://phabricator.wikimedia.org/T270704
[1]
https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/Enhanceme…
--
Arturo Borrero Gonzalez
SRE / Wikimedia Cloud Services
Wikimedia Foundation
We will be upgrading the Cloud-VPS OpenStack install later today
beginning at 14:00 UTC (7:00 AM Pacific time).
The total upgrade should take 60-90 minutes. During the upgrade period
Horizon will be disabled. There may also be brief network interruptions
as we restart router services.
-Andrew + the WMCS Team
TL;DR:
* We messed up when replacing the mail server in Toolforge
* We didn't notice that we had messed up for nearly 3 weeks
* Toolforge servers should be able to send outbound email again now
We have been working to replace some of the Cloud VPS instances in the
Toolforge project with new instances running Debian Buster
(<https://phabricator.wikimedia.org/T275864>). One step in this
process was to replace the mail server instance that handles all
outbound mail.
We setup a new mail server on 2021-03-31, but missed an important
configuration step of telling the rest of the instances in the
Toolforge project to use the new server when sending outgoing mail. A
Toolforge user reported on irc at 2021-04-20T21:11Z that they had not
received expected emails from their tool recently. Investigation
revealed the broken configuration and work started to correct the
problem. Around 2021-04-20T21:52Z we deployed the correct mail relay
host configuration. Over the next 30 minutes or so this configuration
update rolled out across the Toolforge instances, re-enabling outbound
mail sending. Around 2021-04-20T22:20Z we ran commands to instruct all
Toolforge instances to "unfreeze" emails which were queued for sending
but marked as "frozen" due to the prior invalid configuration.
Emails are now being sent out as expected. We apologize for the
interruption in service. We will also be looking into some active
monitoring system for outbound email delivery to catch problems
similar to this more quickly in the future.
Bryan, on behalf of the Toolforge admin team
--
Bryan Davis Technical Engagement Wikimedia Foundation
Principal Software Engineer Boise, ID USA
[[m:User:BDavis_(WMF)]] irc: bd808
We will be upgrading the Cloud-VPS OpenStack install tomorrow beginning
at 14:30 UTC (7:30 AM Pacific time).
The total upgrade should take 60-90 minutes. During the upgrade period
Horizon will be disabled. There may also be brief network interruptions
as we restart router services.
-Andrew + the WMCS Team