Every year or so the Cloud Services team tries to identify and clean up
unused projects and VMs. We do this via an opt-in process: anyone can
mark a project as 'in use,' and that project will be preserved for
another year.
I've created a wiki page the lists all existing projects, here:
https://wikitech.wikimedia.org/wiki/News/Cloud_VPS_2019_Purge
If you are a VPS user, please visit that page and mark any projects that
you use as {{Used}}. Note that it's not necessary for you to be a
project admin to mark something -- if you know that you're currently
using a resource and want to keep using it, go ahead and mark it
accordingly. If you /are/ a project admin, please take a moment to mark
which VMs are or aren't used in your projects.
When December arrives, I will shut down and begin the process of
reclaiming resources from unused projects.
If you think you use a VPS project but aren't sure which, I encourage
you to poke around on https://tools.wmflabs.org/openstack-browser/ to
see what looks familiar. Worst case, just email
cloud(a)lists.wikimedia.org with a description of your use case and we'll
sort it out there.
Exclusive toolforge users are free to ignore this task.
Thank you!
-Andrew and WMCS team
_______________________________________________
Wikimedia Cloud Services announce mailing list
Cloud-announce(a)lists.wikimedia.org (formerly labs-announce(a)lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud-announce
Hi there!
Next Monday 2019-10-28 @ 14:30 UTC we will do a maintenance operation on
Toolforge which consists in rebuilding the main front proxy [0] used to serve
webservices. We expect this to be done within a 30 minutes window.
The operation consists on replacing the old virtual machines supporting the
proxy (currently running Debian Jessie) with more modern instances running
Debian Buster. Both Grid/Kubernetes backends are affected by this change. We
don't expect a lot of service downtime, but there is a key point in the
operation which is migrating data stored in Redis which can be tricky. The o
Examples of things affected by this change:
* Browsing Toolforge webservices
* Browsing to https://tools.wmflabs.org/<toolname>
* Browsing to https://tools.wmflabs.org/admin/ (Toolforge landing page)
* Browsing PAWS (to some extent, since it shares part of the toolforge proxy)
Example of things not affected by this change:
* webservices backend operations
* SSH bastions
* grid queues, grid jobs
* wiki-replicas, toolsdb
* other CloudVPS projects
regards.
[0] https://phabricator.wikimedia.org/T235627
--
Arturo Borrero Gonzalez
SRE / Wikimedia Cloud Services
Wikimedia Foundation
With a redundant power supply upgrade going on this week in the datacenter that could affect the VM that Toolsdb runs on, we anticipate a brief outage Thursday 10/24 @11am UTC of the mysql service to protect data in case anything goes wrong. This may require a restart of a tool to reconnect to the database. We do not anticipate any worse disruptions, but if there is any disruption beyond what is planned, a failover may be necessary, which will not include the non-replicated tables mentioned here https://wikitech.wikimedia.org/wiki/Help:Toolforge/Database#ToolsDB_Backups… <https://wikitech.wikimedia.org/wiki/Help:Toolforge/Database#ToolsDB_Backups…>
The maintenance requiring this notice and action is detailed here https://phabricator.wikimedia.org/T227540 <https://phabricator.wikimedia.org/T227540>. The VM resides on the cloudvirt1019 hypervisor, which is why it is in scope.
We sincerely apologize for the short notice.
Brooke Storm
Senior SRE
Wikimedia Cloud Services
bstorm(a)wikimedia.org <mailto:bstorm@wikimedia.org>
IRC: bstorm_
_______________________________________________
Wikimedia Cloud Services announce mailing list
Cloud-announce(a)lists.wikimedia.org (formerly labs-announce(a)lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud-announce
Effective immediately, Toolforge's webservices (re-)started by the
webservice
command will no longer produce a $HOME/access.log file by default.
This feature can easily be re-enabled if required for your tool. To do so,
please
follow the instructions posted at https://w.wiki/9go
Since not everyone requires the access.log feature, we have decided that it
makes more sense to have it disabled by default. We believe that this change
will improve the overall Toolforge experience. Not only we can free up
disk spaces but also the CPU cycle taken up by the web servers to produce
the
access.log files.
If you see odd behaviour when starting or restarting a webservice that looks
like it could be related to this change please let myself or one of the
Toolforge admins know by either filing a Phabricator bug report or for
faster
response joining the #wikimedia-cloud IRC channel on Freenode and sending a
"!help ...." message to the channel explaining your issue.
Hieu Pham - on behalf of the Toolforge admin team
_______________________________________________
Wikimedia Cloud Services announce mailing list
Cloud-announce(a)lists.wikimedia.org (formerly labs-announce(a)lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud-announce
TL;DR: All webservices running on the grid engine backend in Toolforge
were restarted around 2019-10-18 21:29 UTC. Following the restart,
these jobs should retain the ability to write to their original
TMPDIR.
Earlier this week Musikanimal commented on a stale ticket [0] about a
mysteriously intermittent "(chunk.c.553) opening temp-file failed: No
such file or directory" error in a particular webservice. A related
bug [1] (now merged into the first as a duplicate) had been looked at
in depth previously by Zhuyifei1999 with no clear conclusion. I
started looking into the problem with little expectation of finding an
answer, but a hope that I could at least rule some things out as the
"root cause".
I got lucky this time and did figure out a root cause for the problem.
It turns out that Grid Engine creates a unique directory under /tmp
for each job that is started. This directory is named /tmp/{job
number}.{task number}.{queue name}. The job's main process is started
with the TMPDIR environment variable pointing to this unique
directory. Separately, we have a daily cron task which runs on each
Grid Engine exec node marked as a part of the webgrid-generic or
webgrid-lighttpd job queues to remove files and empty directories
under /tmp which have not been accessed in more than 24 hours. This
cleanup task was deleting the empty TMPDIR of jobs which had not
written to or read from their TMPDIR in more than 24 hours. Once I
made this connection, the fix was as simple as configuring the cleanup
task to ignore empty directories that look like the TMPDIR pattern
used by Grid Engine.
After the configuration change was deployed, I setup a temporary
webservice to monitor its own TMPDIR to verify that it was indeed
fixed. Earlier today that tool crossed the 48 hour runtime worst case
I had calculated with no recurrence of the error. With that
confirmation of the fix, I decided to restart all of the webservice
jobs running on the grid engine in Toolforge to ensure that they have
a TMPDIR created. This seemed like a better solution than just
emailing the cloud-announce list to tell folks to restart their
webservices if they were likely to be affected.
The process I went through in debugging is well documented on the task
[2]. The notes there do not include all the web searches I did for
various error messages and documentation of FOSS software involved in
the webservice, but they do pretty clearly show that I started out
looking in one place and ended up figuring out the root cause was
something completely different. The final analysis also shows how
fixing one problem [3] can unintentionally lead to new problems.
[0]: https://phabricator.wikimedia.org/T217815
[1]: https://phabricator.wikimedia.org/T225966
[2]: https://phabricator.wikimedia.org/T217815#5577987
[3]: https://phabricator.wikimedia.org/T190185
Bryan
--
Bryan Davis Technical Engagement Wikimedia Foundation
Principal Software Engineer Boise, ID USA
[[m:User:BDavis_(WMF)]] irc: bd808
The <wikidb>.analytics.db.svc.eqiad.wmflabs database servers have been
experiencing some stability issues in the last two to three weeks that
we have reason to believe are related to query volume. The DBA team at
the Wikimedia Foundation is looking into various changes that may help
with these problems including software upgrades for our MariaDB
deployments.
Today we took an initial step of reducing the maximum time allowed for
a query to complete on the <wikidb>.analytics.db.svc.eqiad.wmflabs
hosts to 1 hour. We were using an upper limit of 4 hours previously.
Our hope is that this change will relieve some stress on the shared
servers and allow us more time to look into other changes to restore
stability. Ideally we will be able to increase the limit again after
making other changes to these systems.
Bryan, on behalf of the Cloud Services team
--
Bryan Davis Technical Engagement Wikimedia Foundation
Principal Software Engineer Boise, ID USA
[[m:User:BDavis_(WMF)]] irc: bd808
_______________________________________________
Wikimedia Cloud Services announce mailing list
Cloud-announce(a)lists.wikimedia.org (formerly labs-announce(a)lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud-announce
We'll be upgrading the cloud services OpenStack install on Monday,
beginning at 14:00 UTC.
The entire upgrade process may take a couple of hours. Early on in the
process, Horizon (and associated OpenStack APIs) will be disabled
(probably for 20 to 30 minutes.) There may also be brief network
interruptions during the upgrade, although if all goes well these will
not be noticeable by users.
Toolforge and existing VMs should be largely unaffected apart from
possible network hiccups.
- Andrew + the WMCS team
_______________________________________________
Wikimedia Cloud Services announce mailing list
Cloud-announce(a)lists.wikimedia.org (formerly labs-announce(a)lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud-announce