During the flurry of activity we had recently in diagnosing and fixing
problems with the shared ToolsDB MariaDB service [0], we made a
configuration change to place a hard limit on the maximum number of
simultaneous connections permitted for each user account [1][2].
The current limit is set at 20 concurrent connections. This should not
cause any problems for a typical webservice or single script using
ToolsDB, but tools making heavy use of ToolsDB may need to make some
adjustments.
As always, tool maintainers can seek advice on dealing with this limit
or other issues in Toolforge from the Toolforge administration team
and others in the community via our Freenode IRC channel
(#wikimedia-cloud), Phabricator tasks, and the
cloud(a)lists.wikimedia.org mailing list.
[0]: https://phabricator.wikimedia.org/T216208
[1]: https://phabricator.wikimedia.org/T216170
[2]: https://mariadb.com/kb/en/library/server-system-variables/#max_user_connect…
Bryan, on behalf of the Toolforge administration team
--
Bryan Davis Wikimedia Foundation <bd808(a)wikimedia.org>
[[m:User:BDavis_(WMF)]] Manager, Technical Engagement Boise, ID USA
irc: bd808 v:415.839.6885 x6855
This is an update on the ongoing problems with the toolsdb service. We are preparing to move to a new server, which is now a functioning replica of the toolsdb server. The first step here is to restart the service in read-only mode, and then we will move the DNS. Expect writes to stop working and connections to drop. When we are moved to the new DNS, services that use this database will need to be restarted.
This will be happening within the next hour unless it is slowed down by some issues or caution.
Brooke Storm
Operations Engineer
Wikimedia Cloud Services
bstorm(a)wikimedia.org <mailto:bstorm@wikimedia.org>
IRC: bstorm_
Hi,
Here is just a brief update on the status of Toolforge and CloudVPS by today
2019-02-16, along with some guess-estimations and what to expect in following
days. Keeping track of all the events we had this week may be complex, because
they were several of them, and heavily intermixed.
* CloudVPS suffered severe hardware issues this week [0]. We solved most of the
problems and added spare hardware [1] because our server capacity was really
lowered. This service should be mostly stable right now.
* Toolsdb (tools.db.svc.eqiad.wmflabs) is currently overloaded and suffering
from hardware errors. We are already working on a replacement for this service
[2]. Services depending on this database aren't working properly (like PAWS) and
Toolforge tools that use it are also affected.
An honest estimation is that services (specially Toolsdb) we won't be fully
recovered until at least next Tuesday (2019-02-26).
Our current plans involve replacing the Toolsdb hardware with virtual machines
inside CloudVPS [3]. We are trying to be extra cautious to prevent data loss and
other problems usually associated with doing things in a rush.
Finally, I would like to mention that we are all well aware of the importance of
these services for the community and we are doing our best to get things fixed.
Thanks for your understanding and patience.
regards
[0] https://wikitech.wikimedia.org/wiki/Incident_documentation/20190213-cloudvps
[1] CloudVPS: drain and rebuild labvirt1009 as cloudvirt1009
https://phabricator.wikimedia.org/T216239
[2] ToolsDB overload and cleanup https://phabricator.wikimedia.org/T216208
[3] Replace labsdb100[4567] with instances on cloudvirt1019 and cloudvirt1020
https://phabricator.wikimedia.org/T193264
--
Arturo Borrero Gonzalez
Operations Engineer / Wikimedia Cloud Services
Wikimedia Foundation
Because bad things come in threes (I'm hoping it's threes and not
sevens) the server that hosts toolsdb is now also misbehaving. Brooke
just now disabled a troubled drive which may have resolved things, but
if the last few hours are any indication then the vast majority of
connection or query attempts are likely to fail until we have a better
solution in place.
We're working on multiple fronts, trying to diagnose and fix the primary
issue while also working to get new hardware online as a possible
replacement server. Neither of those things are likely to get done
until tomorrow, though, so toolforge will be in pretty bad shape in the
meantime.
It has been a rough couple of days, but rest assured we're taking notes
about how to prevent outages like these in the future. Thank you for
your patience in the meantime!
-Andrew + the cloud team
Today we have deployed an updated version of the webservicemonitor
service that we use to help ensure that `webservice
--backend=gridengine ...` processes are actively running on the job
grid. The main change in this new version is that we have implemented
tracking of the timestamp of past restart attempts for each tool and a
restart rate limit. The initial limit we have set for this is 3
restarts per 60 minute sliding window.
This change will not stop a tool maintainer from running `webservice
restart` manually. You can read more of the reasoning behind the
change at <https://phabricator.wikimedia.org/T107878>.
Bryan
--
Bryan Davis Wikimedia Foundation <bd808(a)wikimedia.org>
[[m:User:BDavis_(WMF)]] Manager, Technical Engagement Boise, ID USA
irc: bd808 v:415.839.6885 x6855
Per our announcement on 2019-01-11 [0], the Ubuntu Trusty job grid is
deprecated and jobs running there need to be manually moved to either
the replacement Debian Stretch job grid or the Toolforge Kubernetes
cluster.
Starting today (2019-02-07) the maintainers of tools that have been
seen to run jobs on the Trusty grid will receive email notices with
the subject "[Toolforge] Tools you maintain are running on Trusty job
grid". These reminders will go out each Thursday through 2019-03-07.
At that point we will increase the frequency of reminders to daily
until we reach the Trusty job grid shutdown date in the last week of
March.
Please see the Toolforge Trusty deprecation news page on wikitech [1]
for more information on how to migrate your grid jobs and things to
watch out for in the process.
== What is changing? ==
* New job grid running Son of Grid Engine on Debian Stretch instances
* New limits on concurrent job execution and job submission by a single tool
* New bastion hosts running Debian Stretch with connectivity to the new job grid
* New versions of PHP, Python2, Python3, and other language runtimes
* New versions of various support libraries
== What should I do? ==
The Cloud Services team has created the Toolforge Trusty
deprecation[1] page on wikitech.wikimedia.org to document basic steps
needed to move webservices, cron jobs, and continuous jobs from the
old Trusty grid to the new Stretch grid. That page also provides more
details on the language runtime and library version changes and will
provide answers to common problems people encounter as we find them.
If the answer to your problem isn't on the wiki, ask for help in the
#wikimedia-cloud IRC channel or file a bug in Phabricator[2].
[0]: https://lists.wikimedia.org/pipermail/cloud-announce/2019-January/000122.ht…
[1]: https://wikitech.wikimedia.org/wiki/News/Toolforge_Trusty_deprecation
[2]: https://phabricator.wikimedia.org/maniphest/task/edit/form/1/?title=Stretch…
Thanks,
Bryan (on behalf of the Toolforge administration team)
--
Bryan Davis Wikimedia Foundation <bd808(a)wikimedia.org>
[[m:User:BDavis_(WMF)]] Manager, Technical Engagement Boise, ID USA
irc: bd808 v:415.839.6885 x6855