On Sun, Sep 21, 2008 at 3:56 AM, Tim Starling <tstarling(a)wikimedia.org> wrote:
This causes a regression to a situation we saw a few
years ago, when as
a team, we were still working out how to optimise MySQL locking. Updates
to the site_stats table are serialized while waiting for various other
updates and PHP code to complete, limiting the total possible edit rate
of the wiki and risking telescoping lock contention and total site
failure. This is exactly what happened on Saturday, except that we were
luckier than we were in the past, and the ES master failed before the
core master, so we still had r/o service.
There are good reasons why the transaction brackets are the way they
are. Please don't change any more of them without discussing the issues
in depth with Brion or Domas or me.
But the transaction was committed immediately after the site_stats
table was updated. Why was that causing site_stats locks to be held
longer than the alternative?
There must be some way to do a non-locking increment of some kind,
that's still rolled back if the main transaction is rolled back. For
instance, perhaps we could insert a row into a special table,
site_stats_updates, in the transaction. This shouldn't lock anything,
at least not for the whole duration of the transaction. Then a single
background process, maybe in a tight loop or with brief delays between
iterations, could start a transaction, select all the data from the
table (not blocking inserts), make the appropriate site_stats updates,
delete all rows from the table that it previously selected (not
touching rows that were added since then), and commit. Since only
this one process would be reading existing rows from the table, there
would be no meaningful locking of any kind.
Of course, this might be a bit extreme just to keep the stupid
site_stats table updated (especially having to have a separate process
do the updates), but the unneeded inaccuracy annoys me. Is the
category table count inaccurate for similar reasons?