On Sun, Sep 21, 2008 at 3:56 AM, Tim Starling tstarling@wikimedia.org wrote:
This causes a regression to a situation we saw a few years ago, when as a team, we were still working out how to optimise MySQL locking. Updates to the site_stats table are serialized while waiting for various other updates and PHP code to complete, limiting the total possible edit rate of the wiki and risking telescoping lock contention and total site failure. This is exactly what happened on Saturday, except that we were luckier than we were in the past, and the ES master failed before the core master, so we still had r/o service.
There are good reasons why the transaction brackets are the way they are. Please don't change any more of them without discussing the issues in depth with Brion or Domas or me.
But the transaction was committed immediately after the site_stats table was updated. Why was that causing site_stats locks to be held longer than the alternative?
There must be some way to do a non-locking increment of some kind, that's still rolled back if the main transaction is rolled back. For instance, perhaps we could insert a row into a special table, site_stats_updates, in the transaction. This shouldn't lock anything, at least not for the whole duration of the transaction. Then a single background process, maybe in a tight loop or with brief delays between iterations, could start a transaction, select all the data from the table (not blocking inserts), make the appropriate site_stats updates, delete all rows from the table that it previously selected (not touching rows that were added since then), and commit. Since only this one process would be reading existing rows from the table, there would be no meaningful locking of any kind.
Of course, this might be a bit extreme just to keep the stupid site_stats table updated (especially having to have a separate process do the updates), but the unneeded inaccuracy annoys me. Is the category table count inaccurate for similar reasons?