Next Friday we'll be upgrading our OpenStack cluster. The upgrade
should not interrupt any existing tools or instances, but during the
upgrade it will be impossible to create, delete, or modify WMCS VMs.
I'll start the process at around 02:00 UTC (7AM PDT). The complete
upgrade may take much of the day; Horizon will be disabled for the
duration of this update. CI/Nodepool will function intermittently
during this time.
I'll send another notice about this as the date approaches.
-Andrew
Dear Wikimedia Cloud VPS/Toolforge users,
If you use dumps/pageviews/other datasets available via /public/dumps on
Toolforge/VPS instance, this email is for you!
The underlying NFS storage server is being replaced. Your access path will
remain the same, but the data may be stale or inaccessible during the
migration to the new servers.
Dates: The migration is scheduled for *April 2nd, 2018 starting at 14:30
UTC*, and is expected to last a few hours.
Thanks! We'll send more updates closer to the migration date. If you have
any questions, just let us know.
Best,
--
Madhumitha Viswanathan & Ariel Glenn
About 24 hours from now we're going to reboot a couple of servers[1] in
the cloud infrastructure to apply security updates.
Few WMCS users (and, in particular, no tools users) should notice any
interruption. Nonetheless, a few services will be down:
- New instance creation will fail
- CI tests will stop running
- Horizon and Wikitech may display incorrect or missing information
Apologies in advance for any inconvenience!
-Andrew
[1] labservices1001 and labcontrol1001
On Friday morning my time (10:00 CST, 8:00 PST, 16:00 UTC) I'll be
switching the dns record for wikitech.wikimedia.org to point to a new
server. This change should be largely invisible to users, but there are
a few things to be ready for:
- Most importantly, YOU WILL BE LOGGED OUT of Wikitech. So if you've
been relying on a persistent session to avoid having to keep track of
your 2FA tokens, today is the day to reset 2FA and record the new
information.
- The new Wikitech build uses a lot of updated software (Debian Stretch,
HHVM, etc.). Although we've done some spot-checks, there may be new
issues that appear for your particular use case. We'll evaluate these
as they crop up, and decide whether or not to revert to the old server.
-Andrew
We lost a KVM host at around 7:20 UTC. Because we use local storage for
instances there are a number of them that are down. Toolforge suffered a
few losses but it seems to have been few enough that GridEngine and
Kubernetes users are unaffected at this time . The initial task is T187292
(with a list of instances), and an incident report will follow. We hope to
recover all of the instances that are down but it will take time to sort
through.
--
Chase Pettet
chasemp on phabricator <https://phabricator.wikimedia.org/p/chasemp/> and
IRC
We have completed all of the updates and reboots for the hypervisors and
instances in https://phabricator.wikimedia.org/T184910, but there are more
maintenance events that are less invasive to come. This is being tracked
in https://phabricator.wikimedia.org/T184910
Most of this will be handled gracefully without user impact, but not all of
it.
We will reboot the `dumps` NFS server that also provides the `maps` and
`scratch` NFS shares tomorrow (1/18/2018). Note the reason this is an
outage event is that this server is a single point of failure. Efforts to
improve this are happening in https://phabricator.wikimedia.org/T168486.
More announcements will come for maintenance that is impactful.
--
Chase Pettet
chasemp on phabricator <https://phabricator.wikimedia.org/p/chasemp/> and
IRC
Sometime soon (probably in the next day or two) we will be applying
kernel patches to all VMs and physical hosts in WMCS. This is to address
an urgent security issue[1] , so we'll be skipping the traditional 7-day
warning period -- basically as soon as proper fixes are available we'll
start patching and rebooting.
As usual, we'll do our best to re-balance Toolforge grid nodes, so
impact on Toolforge users should be minimal (worst case you may need to
manually restart interrupted tasks).
For other users: if your VPS project requires special handling or
specific notice about when a particular VM will reboot, please add a
subtask describing your need to https://phabricator.wikimedia.org/T184189 .
[1] https://en.wikipedia.org/wiki/Meltdown_(security_vulnerability)
On 2018-01-09 the wikidatawiki database will move from its current
home on the "s5" slice to a brand new "s8" slice. The
wikidatawiki.{analytics,web}.db.svc.eqiad.wmflabs and
wikidatawiki.labsdb DNS service names will be updated to point to the
new slice host by system administrators. This change should not affect
most users of the Wiki Replica servers.
Only applications that are connecting to
s5.{analytics,web}.db.svc.eqiad.wmflabs or s5.labsdb and expecting the
wikidatawiki_p database to be present will be affected. These
applications should update their configuration to connect to the new
"s8" slice instead.
This is the end point (or nearly so) of a large amount of work that
has been done by Wikimedia's fabulous DBA team of Jamie and Manuel to
improve the health of the Wikidata wiki. See
<https://phabricator.wikimedia.org/T177208> for more details.
Bryan
--
Bryan Davis Wikimedia Foundation <bd808(a)wikimedia.org>
[[m:User:BDavis_(WMF)]] Manager, Cloud Services Boise, ID USA
irc: bd808 v:415.839.6885 x6855
The labsdb1003.eqiad.wmnet (aka c3.labsdb) server is no longer serving
*.labsdb requests.
The c3.labsdb service name will continue to point to the
labsdb1003.eqiad.wmnet server for the near future, but replication
will soon stop there and all tables will be made read-only.
User databases on c1.labsdb and c3.labsdb listed at
https://tools.wmflabs.org/tool-db-usage/ will be going away on
2018-01-03. You will need to migrate these to
tools.db.svc.eqiad.wmflabs if you need to save the data.
TL;DR
* Change your tools and scripts to use:
- "*.web.db.svc.eqiad.wmflabs" (real-time response needed)
- "*.analytics.db.svc.eqiad.wmflabs" (batch jobs; long queries)
* Replace "*" with either a shard name (e.g. s1) or a wikidb name
(e.g. enwiki).
* The new servers do not support user created databases/tables because
replication can't be guaranteed. See T156869 and below for more
information.
* Migrate your user created tables to tools.db.svc.eqiad.wmflabs
(also known as tools.labsdb) and JOIN via application space logic
rather than in-process in the database.
What is changing?
* Wednesday 2017-12-13
** "*.labsdb" service names switched to point at
"*.web.db.svc.eqiad.wmflabs" equivalents.
** User created tables will not be allowed on the new servers.
** "c3.labsdb" still points at labsdb1003.eqiad.wmnet
* Thursday 2017-12-14
** DBAs will stop replication from production hosts to labsdb1003.eqiad.wmnet
** DBAs will make databases on labsdb1003.eqiad.wmnet read-only for all users
* Wednesday 2018-01-03
** labsdb1001.eqiad.wmnet (aka c1.labsdb) will be shutdown permanently
** labsdb1003.eqiad.wmnet (aka c3.labsdb) will be shutdown permanently
Why are we doing this?
See <https://wikitech.wikimedia.org/wiki/Wiki_Replica_c1_and_c3_shutdown>
and <https://phabricator.wikimedia.org/T142807> for a more complete
description of the reasons for these changes.
Bryan (on behalf of the Wikimedia Cloud Services and DBA teams)
--
Bryan Davis Wikimedia Foundation <bd808(a)wikimedia.org>
[[m:User:BDavis_(WMF)]] Manager, Cloud Services Boise, ID USA
irc: bd808 v:415.839.6885 x6855
Hello all,
Some tools running on the Toolforge Kubernetes cluster are currently
suffering from network failures. It's not yet fully diagnosed, although
we have some ideas as to how to at least reduce the impact. The
tracking bug is https://phabricator.wikimedia.org/T182722.
We'll send another update when we have more information and/or when
things are resolved; in the meantime no action is required on your part
as we'll most likely restart affected tools and services ourselves as
part of fixing the problem.
Sorry for the downtime!
-Andrew + the WMCS team