Hello,
Pretty much everyone who's dealt with creating views for new wikis on the
labs hosts have experienced issues with "Access denied" sometimes.
This was usually due to the MariaDB grant role being missed. We tried to
workaround this by including the grant addition on the maintain-views
script.
Unfortunately, we ran into very weird problems when doing so and this is an
example: https://phabricator.wikimedia.org/T193187#4273281
After lots of back and forth we decided to create a bug to MariaDB (
https://jira.mariadb.org/browse/MDEV-16466) which was confirmed by MariaDB
yesterday and pointed to a similar issue (
https://jira.mariadb.org/browse/MDEV-14732).
The expected fix will come in 10.4 (we are in 10.1), so quite long ahead of
us.
So, for now, the workaround before adding new views is to manually add the
GRANT on the DB and then run the script:
GRANT SELECT, SHOW VIEW ON `newiki\_p`.* to labsdbuser';
Hopefully with this email everyone is on the same page now.
Thanks everyone (specially Brooke for helping me out with the
troubleshooting!)
Manuel.
Hi,
If we poke roles in the firewall so Icinga can reach the VMs and we
define the monitoring::service stuff in Puppet, is that all we need to
shutdown Shinken? Do you think there would be any concerns with going
that route?
I'm asking about this because soon we'll see requests about removing
more and more Jessie support from the Puppet codebase [0] and there's no
exit strategy from Jessie for Shinken (it's not available in Stretch
unless we want to package things ourselves, which I tried and failed).
0 - https://gerrit.wikimedia.org/r/c/operations/puppet/+/491460
Thanks,
--
Giovanni Tirloni
Operations Engineer
Wikimedia Foundation
Hi,
Here is just a brief update on the status of Toolforge and CloudVPS by today
2019-02-16, along with some guess-estimations and what to expect in following
days. Keeping track of all the events we had this week may be complex, because
they were several of them, and heavily intermixed.
* CloudVPS suffered severe hardware issues this week [0]. We solved most of the
problems and added spare hardware [1] because our server capacity was really
lowered. This service should be mostly stable right now.
* Toolsdb (tools.db.svc.eqiad.wmflabs) is currently overloaded and suffering
from hardware errors. We are already working on a replacement for this service
[2]. Services depending on this database aren't working properly (like PAWS) and
Toolforge tools that use it are also affected.
An honest estimation is that services (specially Toolsdb) we won't be fully
recovered until at least next Tuesday (2019-02-26).
Our current plans involve replacing the Toolsdb hardware with virtual machines
inside CloudVPS [3]. We are trying to be extra cautious to prevent data loss and
other problems usually associated with doing things in a rush.
Finally, I would like to mention that we are all well aware of the importance of
these services for the community and we are doing our best to get things fixed.
Thanks for your understanding and patience.
regards
[0] https://wikitech.wikimedia.org/wiki/Incident_documentation/20190213-cloudvps
[1] CloudVPS: drain and rebuild labvirt1009 as cloudvirt1009
https://phabricator.wikimedia.org/T216239
[2] ToolsDB overload and cleanup https://phabricator.wikimedia.org/T216208
[3] Replace labsdb100[4567] with instances on cloudvirt1019 and cloudvirt1020
https://phabricator.wikimedia.org/T193264
--
Arturo Borrero Gonzalez
Operations Engineer / Wikimedia Cloud Services
Wikimedia Foundation
Hi,
due to a HW issue with the disks [0], cloudvirt1018 is currently shutdown and
not responding in any capacity.
All VM instances that were running in this hypervisor are totally unreachable.
The count of affected VM instances [1] is 64.
[0] Degraded RAID on cloudvirt1018 https://phabricator.wikimedia.org/T216004
[1] cloudvps: evaluate draining cloudvirt1018
https://phabricator.wikimedia.org/T216030
--
Arturo Borrero Gonzalez
Operations Engineer / Wikimedia Cloud Services
Wikimedia Foundation
I have a script created to send out nag emails for moving things off
of the Trusty job grid. I wanted to share an example output with y'all
before I run the script for the first time to get any feedback you may
have about better wording or other changes. In an attempt to keep
folks who are running a lot of tools from being bombarded by messages
the script will send a single email per Toolforge maintainer listing
all of the tools that the particular maintainer has access to that
need to be migrated.
Bryan
---------- Forwarded message ---------
From: Toolforge admins <tools.admin(a)tools.wmflabs.org>
Date: Wed, Feb 6, 2019 at 10:28 PM
Subject: [Toolforge] Tools you maintain are running on Trusty job grid
To: <bd808(a)tools.wmflabs.org>
Hello bd808,
This email is a reminder that the tools listed below have run jobs and/or
webservices using the Ubuntu Trusty job grid in the past 7 days. This job grid
will be shutdown on or before the week of 2019-03-25 as the final step in the
removal of Ubuntu Trusty from Toolforge and the larger Cloud VPS environment.
* convert
* gridengine-status
* irc-wmt
* jouncebot
* meetbot
* my-first-flask-tool
* mysql-php-session-test
* stewardbots
* sulinfo
See <https://tools.wmflabs.org/trusty-tools/u/bd808> for more details
on these tools and the jobs that have been seen.
Please see the migration instructions on Wikitech [0] for more information on
how to move your tools to either the new Debian Stretch job grid or the
Kubernetes cluster.
[0]: https://wikitech.wikimedia.org/wiki/News/Toolforge_Trusty_deprecation
Thanks,
The Toolforge admin team
--
Bryan Davis Wikimedia Foundation <bd808(a)wikimedia.org>
[[m:User:BDavis_(WMF)]] Manager, Technical Engagement Boise, ID USA
irc: bd808 v:415.839.6885 x6855
Hi,
These emails are causing alert fatigue.
We've tweaked the thresholds high enough to make them rare but they still
ocurr and we never take any action (in part because there's nothing
feasible to be done until we change our storage situation and/or most
workloads are migrated to Kubernetes where we could implement better
controls).
I'd like to propose we disable these alerts for the time being and
re-evaluate our service level indicators when appropriate.
Giovanni Tirloni
Operations Engineer
Wikimedia Cloud Services
On Mon, Feb 4, 2019, 01:47 shinken <shinken(a)shinken-02.shinken.eqiad.wmflabs
wrote:
> Notification Type: RECOVERY
>
> Service: High iowait
> Host: tools-exec-1419
> Address: 10.68.23.223
> State: OK
>
> Date/Time: Mon 04 Feb 03:46:59 UTC 2019
>
> Notes URLs:
>
> Additional Info:
>
> OK: All targets OK
>