Hello,
Tomorrow the SRE team will be carrying out an upgrade of the switches in
eqiad row B: (https://phabricator.wikimedia.org/T330165) at 14:00 UTC.
The network outage to this row resulting from this work is expected to
be around 30 minutes, all being well.
In support of this work, the Data Engineering team will be putting HDFS
file system into safe mode at approximately 13:30 UTC tomorrow, which
means that write operations to the cluster will be refused.
Jobs sent to the YARN cluster …
[View More]will also be refused from around the same
time, so please try to plan any work that you may have for the cluster
to avoid this maintenance window.
Some additional internal-facing services for analytics such as Hive,
Superset, Presto, and the Druid-analytics cluster will also be largely
unavailable for some periods while the switch upgrade takes place.
The public-facing Analytics Query Service (AQS) will continue to
function, albebeit with a degraded response to some queries. However
Wikistats (stats.wikimedia.org) will be unavailable whilst the switch
upgrade is in progress.
Finally, two of the stats servers, stat1007 and stat1009, will be
unavailable, so please save any work that you may have on these servers
before the loss of connectivity.
Please do reach out via any of the normal channels (email:
analytics(a)lists.wikimedia.org , IRC: #wikimedia-analytics , Slack
#data-engineering ) if you have any queries or concerns.
Kind regards,
Ben
--
*Ben Tullis*(he/him)
Senior Site Reliability Engineer
Wikimedia Foundation <https://wikimediafoundation.org/>
[View Less]
Hello and apologies for the short notice.
We are required to put HDFS into safe mode at approximately 13:50
UTC today, which means that the file system will be read-only.
This might be for as little as 30 minutes, but the maintenance window
we're working within is for up to 2 hours, so the actual period of
read-only access will depend on the outcome of the eqiad row A switches
upgrade (https://phabricator.wikimedia.org/T329073) by the
Infrastructure Foundations …
[View More]team.
We will be pausing ingestion to the Data Lake a little ahead of this
time, so there will be a delay in dataset availability on HDFS,
Cassandra, and Druid etc.
Apologies for any inconvenience that this disruption to service will
cause you.
Please do let us know by reply to this list or in #wikimedia-analytics
on IRC if you have any queries, or would like to follow-along with our
support of the maintenance work.
Kind regards,
Ben Tullis
--
*Ben Tullis*(he/him)
Senior Site Reliability Engineer
Wikimedia Foundation <https://wikimediafoundation.org/>
[View Less]