One of tools.dplbot's daily tasks has been having repeated problems since yesterday. A script that ran without errors and completed in about 10 minutes on Friday ran for over 90 minutes on Saturday, and died with a "MySQL server has gone away" error. There were no edits to the script in between Friday and Saturday, so I have to assume that something changed on the server side.
The script reads from enwiki.analytics.db.svc.eqiad.wmflabs, and both reads from and writes to tools.labsdb. All of the errors occurred on writes to the user database. I was able to work around the errors by dropping the database connection and opening a new one immediately before writing (I have no idea why this works, since the timeout setting on the database for inactive connections is 8 hours, and this script was not even running for two hours; but it did work). However, the script continues to run for an order of magnitude longer than it did on Friday (~100 minutes vs. ~10 minutes). Is anyone else experiencing similar issues?
On Sun, Dec 17, 2017 at 9:44 AM, Russell Blau russblau@imapmail.org wrote:
One of tools.dplbot's daily tasks has been having repeated problems since yesterday. A script that ran without errors and completed in about 10 minutes on Friday ran for over 90 minutes on Saturday, and died with a "MySQL server has gone away" error. There were no edits to the script in between Friday and Saturday, so I have to assume that something changed on the server side.
The script reads from enwiki.analytics.db.svc.eqiad.wmflabs, and both reads from and writes to tools.labsdb. All of the errors occurred on writes to the user database. I was able to work around the errors by dropping the database connection and opening a new one immediately before writing (I have no idea why this works, since the timeout setting on the database for inactive connections is 8 hours, and this script was not even running for two hours; but it did work). However, the script continues to run for an order of magnitude longer than it did on Friday (~100 minutes vs. ~10 minutes). Is anyone else experiencing similar issues?
Can you determine if the increased runtime is from reading data from the enwiki side or from writing to the toolsdb side?
This sounds like something that is worth of opening a Phabricator task about. We do have an existing ticket (https://phabricator.wikimedia.org/T180380) that may also be somehow related depending on where the disconnects are happening.
Bryan
All of the errors occurred on writes to the user database.
That is strange, because while "enwiki.analytics.db.svc.eqiad.wmflabs" is "new" because it is served by new sets of servers and has been upgraded recently, plus it is being tuned; toolsdb has not been touched I think for a couple a weeks, when it was upgraded, plus at the time it is not handled by a proxy.
Do you use connection pooling/persistent connections? That is not allowed [2], but more important than that, it may create connection problems if a server fails over automatically, because it will keep pointing to the wrong server.
There was not an overload on toolsdb last week that could explain the extra writing load: [0]. There was one overload, however, on labsdb1009 (analytics) during the weekend, which lead to me baning/throttling and notifying several users as they had created a denial of service: [1]
Notice one big change on the new servers (analytics and web) is that right now there is no query limitation- if some user runs 10 long-running queries, they can and that could affect other users, I have not limited that except on per user issues- if the community wants to agree and set up some, I can do that with no problem, but now that we are not so resource-bound I did not want to introduce artificial limitations as some people didn't like the limits on the old servers because of lower resources.
[0] < https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&am...
[1] https://phabricator.wikimedia.org/T182997 [2] < https://wikitech.wikimedia.org/wiki/Help:Toolforge/Database#Connection_handl...
This sounds like something that is worth of opening a Phabricator task
about. We do have an existing ticket (https://phabricator.wikimedia.org/T180380) that may also be somehow related depending on where the disconnects are happening.
Please share details of connection (user, code, timestamps) on a phabricator task- maybe there is a slowdown on toolsdb we have not yet realized. That way we can have a deeper look.