William Allen Simpson wrote:
Tim Starling wrote:
William Allen Simpson wrote:
Wouldn't want to bother anybody during an
outage, as I'm sure that folks
are busy. The point of a postmortem is to figure out how to prevent the
same from happening in the future.
Since I still don't have a clue what you're talking about, preventing it
from happening in the future might be difficult. I'll ask again: what is
"local time"? Which local time are you talking about?
Since the thread hasn't had the words "local time" for some time, it
was
hard to figure out your query. Going back to my first message, there is
a "local time" in parentheses. In context, it is clear that "last night
(local time)" relates to your data center. That is, local night.
Meet the sysadmin team (with corresponding timezones)
Brion: PDT -7
Kyle: EDT -4
Kate: BST +1
JeLuF: CEDT +2
Mark: CEDT +2
Domas: usually EET +2, currently travelling somewhere
Me: AEST +10
The only person who lives near the servers is Kyle, and he was hired as a
hardware tech for that exact reason. The servers all have their clocks set
to UTC, and we use UTC in logs and communications.
Had you looked at the graphs (or other logs) at the
time, or anytime in
the following day (all I would hope), the data was obvious. Since you
didn't, I've attached some of the dozen or so that I saved.
The time was 02:30+ UTC.
Ah, well in that case, I can tell you exactly what happened. The hero of the
day was Zsinj, a canny newbie who had his eye on the relevant monitoring
graphs, and alerted us to the problem immediately, using very specific
terms, allowing us to track down and fix it rapidly.
Log extract from #wikimedia-tech follows, times are UTC+10.
[12:32] <Zsinj> Just had a SQL spike and squids almost flatlined. What happened?
[12:34] <TimStarling> Zsinj: had?
[12:34] <Interiot> network is slow maybe?
[12:34] <TimStarling> oh, continuing
[12:34] <Zsinj> SQL loads adb2, db4, and ariel all skyrocket load averages
[12:35] <Zsinj> db2, db4, and ariel*
[12:39] <Zsinj> hm, something caused sq1 to really cut its load, that's what
increased the demand on the SQL servers.
[12:39] <DragonFire2410> Wikinews not responding
[12:39] * mboverload stabs himself because someone thought it was a good
idea to put the datacenter in Flordia
[12:39] <Zsinj> what's wrong with florida?
[12:40] <Zsinj> TimStarling: Perhaps something with sq1?
At about this time I identified the problem as being a flood of
Special:Listusers type queries. Fearing a deliberate DoS attack, I invited
Zsinj into the private channel, so that he could watch while we fixed the
problem that he described so well and so promptly. My immediate response was
to disable the special page and kill the queries. While I was doing that, I
had time to type a few lines into IRC. Kate (consanguinity) was also there.
[12:44] * Zsinj (n=chatzill(a)node230-67.unnamed.db.erau.edu) has joined
#secretchannel
[12:44] <TimStarling> invited Zsinj
[12:44] <Zsinj> Hello everyone.
[12:45] <TimStarling> I'm a bit busy at the moment
[12:46] <TimStarling> basically we had a flood of requests for
Special:Listusers with a limit of 500
[12:46] <Zsinj> Interesting.
[12:46] <TimStarling> it could have been a deliberate DoS attack
[12:46] <Zsinj> it certainly has the symptoms of one.
[12:46] <consanguinity> was it the same page? more likely someone trying to
list all the users, if not
[12:47] <Zsinj> or a rogue bot request?
[12:47] <consanguinity> assume stupidity before malice ;-)
[12:47] <TimStarling> well yeah, these things have almost always turned out
to be accidents
[12:47] <TimStarling> but I thought I'd better switch channel just in case
[12:48] <Zsinj> Good precaution. :)
[12:48] <TimStarling> I disabled that special page altogether
[12:48] <TimStarling> then I killed the queries, still in progress
[12:48] <Zsinj> squids and apaches are on the upswing and SQL load is
decreasing.
[12:49] <consanguinity> is listusers still using limit,offset or was it slow
for another reason?
Some pasting of queries and thinking out loud followed. I worked out the
problem and applied a temporary fix at 13:03 UTC+10. I summarised it in a
later conversation with Domas on #wikimedia-tech, who among other things is
our local DB expert:
[00:04] <Wegge> So it's around 5 or 6 AM for you?
[00:04] <dammit> 7AM
[00:11] * dammit is now known as domas
[00:17] <TimStarling> hi domas
[00:17] <domas> hey Tim
[00:17] <TimStarling> we had some stuff going on earlier today with GROUP BY
that you might be interested in
[00:17] <TimStarling> well, not today in your timezone obviously
[00:18] <TimStarling> mysql> SELECT 'Listusers' as type, 2 AS namespace,
user_name AS title, user_name as value, user_id, COUNT(ug_group) as
numgroups FROM `user` LEFT JOIN `user_groups` ON user_id=ug_user GROUP BY
user_name, user_id ORDER BY value LIMIT 50,50;
[00:18] <TimStarling> 50 rows in set (32.94 sec)
[00:18] <TimStarling> mysql> SELECT 'Listusers' as type, 2 AS namespace,
user_name AS title, user_name as value, user_id, COUNT(ug_group) as
numgroups FROM `user` LEFT JOIN `user_groups` ON user_id=ug_user GROUP BY
user_name ORDER BY value LIMIT 50,50;
[00:18] <TimStarling> 50 rows in set (0.01 sec)
[00:18] <TimStarling> the first one groups by two columns
[00:19] <TimStarling> it was done in the name of PostgreSQL compatibility:
http://mail.wikipedia.org/pipermail/mediawiki-cvs/2006-April/014586.html
[00:19] <domas> riiiiight
[00:19] <domas> probably it's better to use min/max for constants
[00:19] <domas> rather than putting them to GROUP BY
[00:19] <domas> this was mistake I was doing few years ago too
[00:22] <TimStarling> well, I reverted the extra fields in GROUP BY, there
were a few of them in that patch
[00:25] <TimStarling>
http://mail.wikipedia.org/pipermail/mediawiki-cvs/2006-April/014764.html
-- Tim Starling