William Allen Simpson wrote:
Tim Starling wrote:
William Allen Simpson wrote:
Wouldn't want to bother anybody during an outage, as I'm sure that folks are busy. The point of a postmortem is to figure out how to prevent the same from happening in the future.
Since I still don't have a clue what you're talking about, preventing it from happening in the future might be difficult. I'll ask again: what is "local time"? Which local time are you talking about?
Since the thread hasn't had the words "local time" for some time, it was hard to figure out your query. Going back to my first message, there is a "local time" in parentheses. In context, it is clear that "last night (local time)" relates to your data center. That is, local night.
Meet the sysadmin team (with corresponding timezones)
Brion: PDT -7 Kyle: EDT -4 Kate: BST +1 JeLuF: CEDT +2 Mark: CEDT +2 Domas: usually EET +2, currently travelling somewhere Me: AEST +10
The only person who lives near the servers is Kyle, and he was hired as a hardware tech for that exact reason. The servers all have their clocks set to UTC, and we use UTC in logs and communications.
Had you looked at the graphs (or other logs) at the time, or anytime in the following day (all I would hope), the data was obvious. Since you didn't, I've attached some of the dozen or so that I saved.
The time was 02:30+ UTC.
Ah, well in that case, I can tell you exactly what happened. The hero of the day was Zsinj, a canny newbie who had his eye on the relevant monitoring graphs, and alerted us to the problem immediately, using very specific terms, allowing us to track down and fix it rapidly.
Log extract from #wikimedia-tech follows, times are UTC+10.
[12:32] <Zsinj> Just had a SQL spike and squids almost flatlined. What happened? [12:34] <TimStarling> Zsinj: had? [12:34] <Interiot> network is slow maybe? [12:34] <TimStarling> oh, continuing [12:34] <Zsinj> SQL loads adb2, db4, and ariel all skyrocket load averages [12:35] <Zsinj> db2, db4, and ariel* [12:39] <Zsinj> hm, something caused sq1 to really cut its load, that's what increased the demand on the SQL servers. [12:39] <DragonFire2410> Wikinews not responding [12:39] * mboverload stabs himself because someone thought it was a good idea to put the datacenter in Flordia [12:39] <Zsinj> what's wrong with florida? [12:40] <Zsinj> TimStarling: Perhaps something with sq1?
At about this time I identified the problem as being a flood of Special:Listusers type queries. Fearing a deliberate DoS attack, I invited Zsinj into the private channel, so that he could watch while we fixed the problem that he described so well and so promptly. My immediate response was to disable the special page and kill the queries. While I was doing that, I had time to type a few lines into IRC. Kate (consanguinity) was also there.
[12:44] * Zsinj (n=chatzill@node230-67.unnamed.db.erau.edu) has joined #secretchannel [12:44] <TimStarling> invited Zsinj [12:44] <Zsinj> Hello everyone. [12:45] <TimStarling> I'm a bit busy at the moment [12:46] <TimStarling> basically we had a flood of requests for Special:Listusers with a limit of 500 [12:46] <Zsinj> Interesting. [12:46] <TimStarling> it could have been a deliberate DoS attack [12:46] <Zsinj> it certainly has the symptoms of one. [12:46] <consanguinity> was it the same page? more likely someone trying to list all the users, if not [12:47] <Zsinj> or a rogue bot request? [12:47] <consanguinity> assume stupidity before malice ;-) [12:47] <TimStarling> well yeah, these things have almost always turned out to be accidents [12:47] <TimStarling> but I thought I'd better switch channel just in case [12:48] <Zsinj> Good precaution. :) [12:48] <TimStarling> I disabled that special page altogether [12:48] <TimStarling> then I killed the queries, still in progress [12:48] <Zsinj> squids and apaches are on the upswing and SQL load is decreasing. [12:49] <consanguinity> is listusers still using limit,offset or was it slow for another reason?
Some pasting of queries and thinking out loud followed. I worked out the problem and applied a temporary fix at 13:03 UTC+10. I summarised it in a later conversation with Domas on #wikimedia-tech, who among other things is our local DB expert:
[00:04] <Wegge> So it's around 5 or 6 AM for you? [00:04] <dammit> 7AM [00:11] * dammit is now known as domas [00:17] <TimStarling> hi domas [00:17] <domas> hey Tim [00:17] <TimStarling> we had some stuff going on earlier today with GROUP BY that you might be interested in [00:17] <TimStarling> well, not today in your timezone obviously [00:18] <TimStarling> mysql> SELECT 'Listusers' as type, 2 AS namespace, user_name AS title, user_name as value, user_id, COUNT(ug_group) as numgroups FROM `user` LEFT JOIN `user_groups` ON user_id=ug_user GROUP BY user_name, user_id ORDER BY value LIMIT 50,50; [00:18] <TimStarling> 50 rows in set (32.94 sec) [00:18] <TimStarling> mysql> SELECT 'Listusers' as type, 2 AS namespace, user_name AS title, user_name as value, user_id, COUNT(ug_group) as numgroups FROM `user` LEFT JOIN `user_groups` ON user_id=ug_user GROUP BY user_name ORDER BY value LIMIT 50,50; [00:18] <TimStarling> 50 rows in set (0.01 sec) [00:18] <TimStarling> the first one groups by two columns [00:19] <TimStarling> it was done in the name of PostgreSQL compatibility: http://mail.wikipedia.org/pipermail/mediawiki-cvs/2006-April/014586.html [00:19] <domas> riiiiight [00:19] <domas> probably it's better to use min/max for constants [00:19] <domas> rather than putting them to GROUP BY [00:19] <domas> this was mistake I was doing few years ago too [00:22] <TimStarling> well, I reverted the extra fields in GROUP BY, there were a few of them in that patch [00:25] <TimStarling> http://mail.wikipedia.org/pipermail/mediawiki-cvs/2006-April/014764.html
-- Tim Starling