We need to execute maintenance on our primary NFS cluster today. This
should not be impacting to users, but in case something does not go as
planned it may be. We will keep the list posted on status as much as
possible. Apologies for the short notice. This is set to begin in 3 hours.
--
Chase Pettet
chasemp on phabricator <https://phabricator.wikimedia.org/p/chasemp/> and
IRC
Hello!
Much of the Cloud Services staff will be traveling and attending
meetings next week. There will always be someone available for
emergencies, but routine support requests may get handled more slowly
than usual.
Things will be back to normal the following Monday, the 25th.
- Andrew + the Cloud Services team
Because the NFS for tools is getting very tight, I am going to clean up (truncate) log, err and out files that are greater than 100M.
On Monday (6/11/2018) I will be starting this cleanup process.
If you have concerns about this process, please let us know on #wikimedia-cloud .
Thank you for your understanding,
Brooke Storm
Operations Engineer
Wikimedia Cloud Services
bstorm(a)wikimedia.org
IRC: bstorm_
As part of routine security maintenance, we'll be rebooting all VMs and
virtualization hosts next Wednesday starting at 14:00 UTC (7AM San
Francisco time).
Toolforge users should be largely unaffected by this activity. Other
projects (including deployment-prep) will experience sporadic downtime,
a few minutes for each interruption.
The entire process will take several hours. If you need a to-the-minute
advance schedule for any particular reboot, please let me know and I'll
put your system at the start.
-Andrew + the cloud team
Hi!
We deleted the prometheus user from LDAP and created it locally [0].
This may cause puppet failures, since there is a timeframe in which the
id/gid in /var/lib/prometheus is the old LDAP one.
We are running a massive, CloudVPS-wide deluser/adduser/chown operation
to fix this.
[0] https://phabricator.wikimedia.org/T196137
ToolsDB will be undergoing maintenance and updates, Tuesday, June 5th at 1500 UTC to 1600 UTC.
Actual outage times should be fairly brief, but during this time the database will be taken offline and the system rebooted. Due to the expected brief nature of the outage and the fact that some tables are not replicated (see https://wikitech.wikimedia.org/wiki/Help:Toolforge/Database#ToolsDB_Backups… <https://wikitech.wikimedia.org/wiki/Help:Toolforge/Database#ToolsDB_Backups…>), we are not planning on failing over to the replica at this time.
Brooke Storm
Operations Engineer
Wikimedia Cloud Services
bstorm(a)wikimedia.org
IRC: bstorm_
We upgraded the Mono/.NET framework in Toolforge/GridEngine from the 3.x
version to 5.x [0].
We discovered that some tweaking is required due to some weird behavior
regarding memory allocation by the framework [1].
The first symptom you will see is your boot doing high CPU load (spins).
The fix is easy, just telling Mono that more memory is available when
running the tool/bot. But you require to cancel your job submissions and
resend. Please refer to the phabricator bug [1] for more details.
Sorry for the inconvenience.
[0] https://phabricator.wikimedia.org/T194665
[1] https://phabricator.wikimedia.org/T195834
Hello!
The Cloud Services team is traveling quite a bit in the next few weeks:
the Hackathon, the OpenStack Summit, and some personal travel. There
will always be at least one person available for emergencies, but please
be patient if we're slow to respond to requests.
Everyone should be back by the first of the month.
- Andrew + the Cloud Services team
As part of some long-deferred routine maintenance, we need to update
(and, in one case, physically move) the network servers that handle all
traffic between WMCS instances. During each change, all WMCS network
traffic (including network access to all tools and VMs) will be
interrupted for several minutes.
The first outage will be:
Tuesday, May 15 at 13:00 UTC
The second outage will be three hours later:
Tuesday, May 15 16:00 UTC
In each case outages should last no more than ten to fifteen minutes.
More details about this move can be found at
https://phabricator.wikimedia.org/T193579 .
-Andrew
Next Friday we'll be upgrading our OpenStack cluster. The upgrade
should not interrupt any existing tools or instances, but during the
upgrade it will be impossible to create, delete, or modify WMCS VMs.
I'll start the process at around 14:00 UTC (7AM PDT). The complete
upgrade may take much of the day; Horizon will be disabled for the
duration of this update. CI/Nodepool will function intermittently
during this time.
-Andrew