> All of the errors occurred on writes to the user database.

That is strange, because while "enwiki.analytics.db.svc.eqiad.wmflabs" is "new" because it is served by new sets of servers and has been upgraded recently, plus it is being tuned; toolsdb has not been touched I think for a couple a weeks, when it was upgraded, plus at the time it is not handled by a proxy.

Do you use connection pooling/persistent connections? That is not allowed [2], but more important than that, it may create connection problems if a server fails over automatically, because it will keep pointing to the wrong server.

There was not an overload on toolsdb last week that could explain the extra writing load: [0]. There was one overload, however, on labsdb1009 (analytics) during the weekend, which lead to me baning/throttling and notifying several users as they had created a denial of service: [1]

Notice one big change on the new servers (analytics and web) is that right now there is no query limitation- if some user runs 10 long-running queries, they can and that could affect other users, I have not limited that except on per user issues- if the community wants to agree and set up some, I can do that with no problem, but now that we are not so resource-bound I did not want to introduce artificial limitations as some people didn't like the limits on the old servers because of lower resources.

[0] <https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&orgId=1&var-server=labsdb1005&var-network=eth0&from=now-7d&to=now>
[1] <https://phabricator.wikimedia.org/T182997>
[2] <https://wikitech.wikimedia.org/wiki/Help:Toolforge/Database#Connection_handling_policy>

This sounds like something that is worth of opening a Phabricator task
about. We do have an existing ticket
(<https://phabricator.wikimedia.org/T180380>) that may also be somehow
related depending on where the disconnects are happening.


Please share details of connection (user, code, timestamps) on a phabricator task- maybe there is a slowdown on toolsdb we have not yet realized. That way we can have a deeper look.
--
Jaime Crespo
<http://wikimedia.org>