Dear users of stat100{4,6,7},
we are planning on upgrading stat1004 to Debian Buster this Thursday
(2020-09-17) after 12:00 CEST (10:00 UTC). We will reinstall the machine,
preserving user data (home directories, /srv), but to be on the safe side,
we will backup that data. After the reinstall and a few tests, we will send
an all-clear to this list.
A few things of note:
- It would be greatly appreciated if you cleaned out unneeded data before
the
backup time mentioned above, thus speeding up backup (and restore if we
need
it).
- Any changes made to the file system contents after the time mentioned
above
may be lost.
- Around the time of the backup, both cron and systemd timers will be
disabled, and still-running process may be ungracefully terminated.
If this process works well, the remaining stat100x machines in need of
update
(6, 7) will be processed in a similar manner.
As always, if there are questions, do not hesitate to contact us.
Best,
Tobias
--
Tobias Klausmann, SRE, Wikimedia Foundation
Hi everybody,
We need to reboot stat1004 to apply some kernel settings. The maintenance
is scheduled for Friday 25th during early EU morning, please let us know if
this impacts your work.
Luca (on behalf of the Analytics team)
Hi everybody,
We created
https://wikitech.wikimedia.org/wiki/Analytics/Systems/Maintenance_Schedule
as an attempt to help all users to prepare for the upcoming maintenance
windows scheduled. Every maintenance window will be announced in this email
list and added to the wiki page, hope it helps!
Luca (on behalf of the Analytics team)
Hi everybody,
I am going to do some maintenance on the Hadoop cluster tomorrow Wed 23rd
that will require some quiet time for TLS certificate upgrades (we use it
to encrypt data for various daemons, like Yarn NodeManagers and HDFS
Journalnodes). With quiet time I mean no Yarn jobs running on the cluster
and HDFS read-only mode, hopefully lasting 30/60 minutes maximum.
More info in https://phabricator.wikimedia.org/T253957
If this impacts some important work please let me know it in the task and
I'll reschedule :)
Luca (on behalf of the Analytics team)
Hi everybody,
In the course of maintenance, I'll reboot stat1008 within the next 1-2
hours. The reboot is necessary to properly update the alternative kernel
module
for GPU access[1], so GPU functionality might not be available immediately
after reboot. I will send a follow up mail once things are back to normal.
Best,
Tobias
[1] https://phabricator.wikimedia.org/T260442
--
Tobias Klausmann, SRE, Wikimedia Foundation
Hi everybody,
stat1005 is back, now running the new DKMS drivers for the GPU.
Hopefully, this should fix the issues reported in T260442.
The machine should also work as before, just better :)
Let us know if anything is amiss. Note that stat1008 is still on
the old setup.
Best,
Tobias
On Wed, Sep 9, 2020 at 12:08 PM Tobias Klausmann <tklausmann(a)wikimedia.org>
wrote:
> Hi everybody,
>
> In the course of maintenance, I'll reboot stat1005 in ~5m. The reboot
> is needed to clear some stuck state on the GPU, as well as testing an
> alternative kernel module for GPU access[1], so GPU functionality might not
> be available immediately after reboot. I will send a follow up mail once
> things are back to what we would call normal.
>
> Best,
> Tobias
>
> [1] https://phabricator.wikimedia.org/T260442
>
> --
> Tobias Klausmann, SRE, Wikimedia Foundation
>
>
--
Tobias Klausmann, SRE, Wikimedia Foundation
Hi everybody,
In the course of maintenance, I'll reboot stat1005 in ~5m. The reboot
is needed to clear some stuck state on the GPU, as well as testing an
alternative kernel module for GPU access[1], so GPU functionality might
not be available immediately after reboot. I will send a follow up mail
once things are back to what we would call normal.
Best,
Tobias
[1] https://phabricator.wikimedia.org/T260442
--
Tobias Klausmann, SRE, Wikimedia Foundation