Hello,

This is a brief reminder that HDFS will be put into safe mode in approximately 1 hour from now.
The YARN scheduler queues will also be set to a STOPPED state for the duration of the switch upgrade.

I'll send out a further message once the work has been completed.

Kind regards,
Ben

On 27/03/2023 11:18, Ben Tullis wrote:

Hello,

Tomorrow the SRE team will be carrying out an upgrade of the switches in eqiad row B: (https://phabricator.wikimedia.org/T330165) at 14:00 UTC. The network outage to this row resulting from this work is expected to be around 30 minutes, all being well.

In support of this work, the Data Engineering team will be putting HDFS file system into safe mode at approximately 13:30 UTC tomorrow, which means that write operations to the cluster will be refused.
Jobs sent to the YARN cluster will also be refused from around the same time, so please try to plan any work that you may have for the cluster to avoid this maintenance window.

Some additional internal-facing services for analytics such as Hive, Superset, Presto, and the Druid-analytics cluster will also be largely unavailable for some periods while the switch upgrade takes place.

The public-facing Analytics Query Service (AQS) will continue to function, albebeit with a degraded response to some queries. However Wikistats (stats.wikimedia.org) will be unavailable whilst the switch upgrade is in progress.

Finally, two of the stats servers, stat1007 and stat1009, will be unavailable, so please save any work that you may have on these servers before the loss of connectivity.

Please do reach out via any of the normal channels (email: analytics@lists.wikimedia.org , IRC: #wikimedia-analytics , Slack #data-engineering ) if you have any queries or concerns.

Kind regards,
Ben

Ben Tullis (he/him)
Senior Site Reliability Engineer
Wikimedia Foundation