[Labs-l] Replication broken

Tue Dec 10 14:14:29 UTC 2013

Marc-André Pelletier <mpelletier at wikimedia.org> wrote:

>> The same thing happened yesterday; Coren determined that a certain project
>> that shall remain nameless, unless you consider "catscan2" to be a name, had
>> been holding a lock on the database for over 12 hours.

> And indeed that's the case again; I need to sit down with the catscan
> people and find a better way for them to do whatever it is they are
> trying to do in a way that does not result in:

> ---TRANSACTION 9337D917, ACTIVE 52358 sec fetching rows
> mysql tables in use 3, locked 3
> 566321 lock struct(s), heap size 56850872, 10281493 row lock(s), undo
> log entries 150718

> I really want to avoid having to install a query killer and place hard
> time limit on running time - no matter where the line it it will be an
> annoyance to /someone/.  There are occasional legitimate uses for very
> long queries; but they should be infrequent and not last *that* long.

IIRC it's not necessarily long queries that lock down the
DB, but queries that requests locks :-).  I see temporary
tables and inserts in the catscan2 code, and I think that
might be the culprit.

That was one of the many surprises when moving from
Toolserver to Labs: On Toolserver, a tool could request a
lock for a minute (or whatever the threshold for the query
killer was), the replication lag would go up, but the query
that actually got killed was the read-only one that had been
running peacefully for hours and not increased replication
lag in any way.

So, *please*, if we think about enabling some sort of query
killer on Tools, let's make *very* sure that it aims accu-
rately.

Tim