Hello,
I have to carry out some planned maintenance on jupyter to apply a small
configuration change. The work involves stopping all user notebook
servers on all stat100x boxes, so that they can be re-spawned with
updated settings. Your jupyter notebook server will be automatically
restarted the next time you try to access it, but the work will
interrupt any running kernels and notebooks that you may be using at the
time. You won't lose any code from the notebooks themselves, however.
I propose to carry out this work on at 10:00 UTC tomorrow, if that's
acceptable.
If this is going to cause you any inconvenience, please let me know and
I will either exclude your personal jupyter notebook server(s) from the
process and work with you to find a more convenient time, or re-schedule
the maintenance window altogether.
If you have any queries or comments, please do let me know.
Thanks and kind regards,
Ben
--
*Ben Tullis*(he/him)
Senior Site Reliability Engineer
Wikimedia Foundation <https://wikimediafoundation.org/>
Hello,
I need to find a time to reboot two of our analytics explorer (aka stat)
servers in order to pick up a new kernel version.
These servers are the two that have the AMD GPUs in them, namely
stat1005 and stat1008.
Ideally I would like to reboot both of these tomorrow, November the 2nd,
between 10:00 UTC and 10:30 UTC.
Please let me know if this maintenance window is too soon and would
cause you inconvenience.
If this is the case, then I will then look to push back back the date of
the reboots to accommodate your needs.
Kind regards,
Ben
--
*Ben Tullis*(he/him)
Senior Site Reliability Engineer
Wikimedia Foundation <https://wikimediafoundation.org/>
Hi,
Just an FYI that I've just finished up Michael Holloway's work on getting
Event Stream Config to support per wiki and/or group overrides
<https://phabricator.wikimedia.org/T277193> in wgEventStreams.
If you need to override settings for EventLogging like sample rate, it is
now possible to do so. Documentation here
<https://wikitech.wikimedia.org/wiki/Event_Platform/Instrumentation_How_To#O…>
.
The main change to make this work was to start keying stream configs by
stream name. There is no longer a 'stream' setting you have to add to the
config.
Enjoy!
-Andrew Otto
SRE, Data Engineering
Wikimedia Foundation
Hi everybody,
We have upgraded the AMD ROCm stack for our GPUs to 4.2 (it is not the
latest upstream but close to it). There are two main things to know:
- If you are using tensorflow-rocm on stat100[5,8], please upgrade it to
version 2.5.0 (that is now the only version supported, previously it was
2.3.1).
- A new package was added to support the ONNX framework (see
https://phabricator.wikimedia.org/T287267)
All details added to
https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/AMD_GPU as
well.
Enjoy the new stack and let me know in the aforementioned task if you
encounter any issue or if you have any questions.
Luca
Hello,
We need to restart the Hive server and metastore components in order to
pick up a recent configuration change.
(https://gerrit.wikimedia.org/r/c/operations/puppet/+/709484)
If possible, I'd like to schedule this operation for 09:00 UTC**this
Wednesday: 2021/08/04
The service restart itself should only take a few moments, but YARN
processes that are using Hive tables at the time of the restart will
likely return an error. Please attempt to schedule any jobs such that
they will not coincide with this restart, or be prepared to re-run them
in case of failure.
If this scheduled maintenance window will cause significant
inconvenience for you, then please let me know and I will attempt to
re-schedule.
The Phabricator task relating to this work is: T279304
Kind regards,
Ben
--
*Ben Tullis*(he/him)
Senior Site Reliability Engineer
Wikimedia Foundation <https://wikimediafoundation.org/>
Hello,
We need to restart the Hive server and metastore components in order to
pick up a recent configuration change.
(https://gerrit.wikimedia.org/r/c/operations/puppet/+/709484)
If possible, I'd like to schedule this operation for 09:00 UTC**this
Wednesday: 2021/08/04
The service restart itself should only take a few moments, but YARN
processes that are using Hive tables at the time of the restart will
likely return an error. Please attempt to schedule any jobs such that
they will not coincide with this restart, or be prepared to re-run them
in case of failure.
If this scheduled maintenance window will cause significant
inconvenience for you, then please let me know and I will attempt to
re-schedule.
The Phabricator task relating to this work is:
Kind regards,
Ben
--
*Ben Tullis*(he/him)
Senior Site Reliability Engineer
Wikimedia Foundation <https://wikimediafoundation.org/>
Hi,
tomorrow (July 20th) at 15:00 UTC (08:00 PDT / 11:00 EDT / 17:00
CEST), there'll be a maintenance to the network switch which connects
the bast1003.wikimedia.org bastion. This will interrupt network links
(and thus your SSH connections over that bastion server) for up to a
minute.
If you're accessing a Wikimedia production server over
bast1003.wikimedia.org around that time, please switch your SSH client
config to use e.g. bast2002.wikimedia.org (which will be unaffected).
The Phabricator task for this network maintenance is
https://phabricator.wikimedia.org/T286069
Cheers,
Moritz
Hi all,
We will be draining the hadoop cluster of jobs Thursday July 20 (tomorrow)
starting at 15:00 UTC (8am PDT).
I apologize for waiting until now to announce the maintenance; let me know
if the short lead on this is an issue, and we can reschedule if necessary.
View planned maintenance here
<https://wikitech.wikimedia.org/wiki/Analytics/Systems/Maintenance_Schedule>.
Reply to this email or comment on the task at
https://phabricator.wikimedia.org/T278423 if you have any questions or
concerns.
Regards,
Razzi & the Data Engineering team
Hi all,
We will be draining the hadoop cluster of jobs ONE LAST TIME for now
Thursday July 1 starting at 15:00 UTC (8am PDT). One of the namenodes was
upgraded last time around, and we'll upgrade the other this Thursday.
Maintenance should last around 2 hours.
View planned maintenance here
<https://wikitech.wikimedia.org/wiki/Analytics/Systems/Maintenance_Schedule>.
Reply to this email or comment on the task at
https://phabricator.wikimedia.org/T278423 if you have any questions or
concerns.
Regards,
Razzi & the Data Engineering team
Hi all,
We will be draining the hadoop cluster of jobs Tuesday May 25 starting at
15:00 UTC (8am PDT) for another attempt to update the operating system of
the namenodes. The first time around, we encountered an issue and called
off the update. This time, I'm estimating the maintenance will last 3
hours. We'll announce when the queue is back to accepting jobs.
View planned maintenance here
<https://wikitech.wikimedia.org/wiki/Analytics/Systems/Maintenance_Schedule>.
Reply to this email or comment on the task at
https://phabricator.wikimedia.org/T278423 if you have any questions or
concerns.
Regards,
Razzi & the Data Engineering team