Hello,
There will be a couple of brief interruptions to some the Data Platform
services this Wednesday and Thursday, as we are supporting SRE
Infrastructure Foundations with some of their work to upgrade the
network switches in T348977 <https://phabricator.wikimedia.org/T348977>.
Specifically, we need to perform a role swap of our two Analytics_Meta
<https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/Analytics_Meta>
database servers, which serve Hive, Druid, DataHub and Superset. The
roles will be swapped on Wednesday at around 10:00 UTC and swapped back
on Thursday at around 10:00 UTC. On each occasion, there will be a brief
period where the databases are made read-only, while the replication
roles are swapped and the application configuration is updated. This may
result in errors if you are actively using any of the applications at
the time.
In order to minimize the chance of data processing errors, I will also
be pausing ingestion to the data lake around 1 hour before each role
swap, so that data pipelines do not try to write to Hive or ingest to
Druid. Therefore you may also notice a delay for data to arrive in HDFS,
Hive, and Druid, but this shouldn't be more than an hour or so.
If you have any queries or concerns, please don't hesitate to let us
know by reply to this email, or on #data-engineering-collab on Slack, or
#wikimedia-analytics on IRC.
Kind regards,
Ben
--
*Ben Tullis*(he/him)
Senior Site Reliability Engineer
Wikimedia Foundation <https://wikimediafoundation.org/>
Good morning,
If you do not use our Archiva
<https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Archiva>
artifact repository service, you may ignore this message.
Apologies for the short notice. This is just to let you know that I will
be performing some maintenance work on our Archiva server today, which
will result in some brief periods of downtime for the service. One
element of this work is a disk storage change operation and the next is
an O/S upgrade. I will try to keep the downtime of the service to a minimum.
Apologies if this instability causes you any inconvenience. Please do
feel free to let me know if this impacts your work and I will try to
help you find a workaround.
Kind regards,
Ben
--
*Ben Tullis*(he/him)
Senior Site Reliability Engineer
Wikimedia Foundation <https://wikimediafoundation.org/>
Hello,
Just a quick reminder that this work to take stat100[4-7] out of service
will be going ahead next Tuesday morning, May 28th.
Please ensure that you have copied any important files to different stat
servers or to HDFS before then.
Feel free do get in touch with us at if you have any concerns about this
schedule.
Kind regards,
Ben
On 07/05/2024 12:47 pm, Ben Tullis wrote:
>
> Hello,
>
> We need to plan to decommission several of the analytics clients
> <https://wikitech.wikimedia.org/wiki/Analytics/Systems/Clients>, also
> referred to as the stats servers, since they have reached their end of
> service date. The servers in question are:
>
> * stat1004 <https://wikitech.wikimedia.org/wiki/Stat1004>
> * stat1005 <https://wikitech.wikimedia.org/wiki/Stat1005>
> * stat1006 <https://wikitech.wikimedia.org/wiki/Stat1006>
> * stat1007 <https://wikitech.wikimedia.org/wiki/Stat1007>
>
> If you actively use these servers, please consider moving your work to
> alternative stat servers (namely, stat10[08-11]) as soon as reasonably
> possible.
>
> Similarly, should you have personal files in your home directory on
> any of these servers that you would like to retain, now would be a
> good time to consider moving them to a different server, or moving
> them to your HDFS home directory.
>
> There are some guides available on syncing files between stats servers
> <https://wikitech.wikimedia.org/wiki/Analytics/Systems/Clients#Rsync_between…>
> and also using the hdfs CLI
> <https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster#How_do_I...>
> to manage files, which may help you to clean up the necessary files.
>
> We would like to be able to decommission these servers *three weeks'
> from today*, which is on *Tuesday May 28th*. Please do feel free to
> get back to us if you feel that this timescale will not allow
> sufficient time for you to migrate your work to alternative servers,
> or if you have any other concerns about this plan.
>
> Kind regards,
> Ben
>
> --
> *Ben Tullis*(he/him)
> Senior Site Reliability Engineer
> Wikimedia Foundation <https://wikimediafoundation.org/>
>
--
*Ben Tullis*(he/him)
Senior Site Reliability Engineer
Wikimedia Foundation <https://wikimediafoundation.org/>
Hello,
I would like to upgrade stat1008 from buster to bullseye this Thursday
at approximately 09:15 UTC.
The upgrade is expected to take up to an hour, during which time
stat1008 will be unavailable for use. Work in your home directories will
be left untouched, so the impact should be low, especially if you are
using conda environments
<https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Conda>.
If this maintenance window is likely to cause an issue for you, please
do let me know and I can look to reschedule the work. We will also be
available after the upgrade, in case you experience difficulties with
the upgraded operating system.
After the upgrade, stat1008 will have new SSH host fingerprints, so I
will update this page SSH_Fingerprints/stat1008.eqiad.wmnet
<https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/stat1008.eqiad.wm…>
and provide some more help to get you reconnected.
Kind regards,
Ben
--
*Ben Tullis*(he/him)
Senior Site Reliability Engineer
Wikimedia Foundation <https://wikimediafoundation.org/>