I wrote a little perl script to monitor the health of Wikipedia by regularly checking the file size of the main log. The idea was that it would email me if apache crashed. I soon discovered something rather interesting and unexpected.
Once every 15 minutes or so, Wikipedia shuts down for a minute or two. This is bad, needless to say. During these episodes, it shows few if any log entries. Load is depressed slightly (from ~20 to ~10), so CPU starvation didn't seem likely. I suspected database troubles.
Attached is a zip file containing the output from SHOW PROCESSLIST. There is one file showing "typical" output, between the episodes, and three showing output in the middle of an episode. One thing stands out like a sore thumb:
SELECT 1 FROM user_newtalk WHERE user_ip='167.206.112.85' SELECT 1 FROM user_newtalk WHERE user_ip='217.244.15.107' SELECT 1 FROM user_newtalk WHERE user_ip='80.55.166.58' SELECT 1 FROM user_newtalk WHERE user_ip='66.196.90.11' SELECT 1 FROM user_newtalk WHERE user_ip='172.176.254.188' SELECT 1 FROM user_newtalk WHERE user_ip='220.54.212.24' SELECT 1 FROM user_newtalk WHERE user_id=7580 SELECT 1 FROM user_newtalk WHERE user_ip='194.78.48.226' SELECT 1 FROM user_newtalk WHERE user_ip='80.225.14.29' SELECT 1 FROM user_newtalk WHERE user_ip='144.32.128.73' SELECT 1 FROM user_newtalk WHERE user_ip='62.216.15.127' SELECT 1 FROM user_newtalk WHERE user_ip='134.151.225.179' SELECT 1 FROM user_newtalk WHERE user_ip='213.81.145.123' SELECT 1 FROM user_newtalk WHERE user_ip='129.137.208.159' SELECT 1 FROM user_newtalk WHERE user_ip='195.93.72.17' SELECT 1 FROM user_newtalk WHERE user_ip='218.19.141.2' SELECT 1 FROM user_newtalk WHERE user_ip='68.11.187.242' SELECT 1 FROM user_newtalk WHERE user_ip='63.34.208.93' SELECT 1 FROM user_newtalk WHERE user_ip='81.196.21.16' SELECT 1 FROM user_newtalk WHERE user_ip='202.156.2.82'
Millions of user_newtalk requests, conspicuously absent from the typical dump. Many of them are doing "statistics", whatever that means, and the rest are locked. One side of me is curious as to what "statistics" means and why it produces this behaviour. The other side of me says
KILL KILL KILL
I DON'T CARE WHY OR HOW IT'S HAPPENING JUST KILL IT NOW!!!!!!
ARRRRGGHH (axe hitting hard drive noises)
Anyone up for some memcached programming?
Also attached is some sample output from the monitor program. Each entry represents 5 seconds, and is expressed in units of the "danger threshold" of 100 bytes per second.
BTW this appears to be unrelated to whatever the problem is with it as I type, i.e. a long-term (~1hr) slowdown with load at 0.25 and log file turnover at ~20% of normal.
-- Tim Starling.
BTW this appears to be unrelated to whatever the problem is with it as I type, i.e. a long-term (~1hr) slowdown with load at 0.25 and log file turnover at ~20% of normal.
Sorry, make that load 35. 0.25 was the load on my desktop PC *smacks head* . I never had that problem when I was using Windows :) All these terminal windows look the same.
-- Tim Starling.
Tim Starling wrote:
Anyone up for some memcached programming?
Talking to myself again... I've committed some completely untested fixes for this problem to CVS, in User.php and UserTalkUpdate.php. The idea is to basically not use the database at all for anonymous users. UserTalkUpdate sets the key in memcached, and if memcached deletes it because it runs out of memory, or the server restarts, tough. User talk messages to anonymous users are only good for a few hours anyway.
I would have tested it but it's getting kind of late here.
-- Tim Starling.
On Fri, Oct 10, 2003 at 12:23:18AM +1000, Tim Starling wrote:
I wrote a little perl script to monitor the health of Wikipedia by regularly checking the file size of the main log. The idea was that it would email me if apache crashed. I soon discovered something rather interesting and unexpected.
Once every 15 minutes or so, Wikipedia shuts down for a minute or two. This is bad, needless to say. During these episodes, it shows few if any log entries. Load is depressed slightly (from ~20 to ~10), so CPU starvation didn't seem likely. I suspected database troubles.
Attached is a zip file containing the output from SHOW PROCESSLIST. There is one file showing "typical" output, between the episodes, and three showing output in the middle of an episode. One thing stands out like a sore thumb:
SELECT 1 FROM user_newtalk WHERE user_ip='167.206.112.85' SELECT 1 FROM user_newtalk WHERE user_ip='217.244.15.107' SELECT 1 FROM user_newtalk WHERE user_ip='80.55.166.58' SELECT 1 FROM user_newtalk WHERE user_ip='66.196.90.11' SELECT 1 FROM user_newtalk WHERE user_ip='172.176.254.188' SELECT 1 FROM user_newtalk WHERE user_ip='220.54.212.24' SELECT 1 FROM user_newtalk WHERE user_id=7580 SELECT 1 FROM user_newtalk WHERE user_ip='194.78.48.226' SELECT 1 FROM user_newtalk WHERE user_ip='80.225.14.29' SELECT 1 FROM user_newtalk WHERE user_ip='144.32.128.73' SELECT 1 FROM user_newtalk WHERE user_ip='62.216.15.127' SELECT 1 FROM user_newtalk WHERE user_ip='134.151.225.179' SELECT 1 FROM user_newtalk WHERE user_ip='213.81.145.123' SELECT 1 FROM user_newtalk WHERE user_ip='129.137.208.159' SELECT 1 FROM user_newtalk WHERE user_ip='195.93.72.17' SELECT 1 FROM user_newtalk WHERE user_ip='218.19.141.2' SELECT 1 FROM user_newtalk WHERE user_ip='68.11.187.242' SELECT 1 FROM user_newtalk WHERE user_ip='63.34.208.93' SELECT 1 FROM user_newtalk WHERE user_ip='81.196.21.16' SELECT 1 FROM user_newtalk WHERE user_ip='202.156.2.82'
Millions of user_newtalk requests, conspicuously absent from the typical dump. Many of them are doing "statistics", whatever that means, and the rest are locked. One side of me is curious as to what "statistics" means and why it produces this behaviour. The other side of me says
Could these hanging selects be just a part of the phenomenon? I mean, assume, the mysql has some reason not to respond, but still accepts new requests. Then all anonymous users reading a page will add a new process looking for a newtalk flag. They don't cause the slowdown, they are its victims.
So the question is: Why does it slow down every 15 minutes? Do we have some cron jobs running every quarter of an hour?
Regards,
JeLuf
Millions of user_newtalk requests, conspicuously absent from the typical dump. Many of them are doing "statistics", whatever that means, and the rest are locked. One side of me is curious as to what "statistics" means and why it produces this behaviour. The other side of me says
Could these hanging selects be just a part of the phenomenon? I mean, assume, the mysql has some reason not to respond, but still accepts new requests. Then all anonymous users reading a page will add a new process looking for a newtalk flag. They don't cause the slowdown, they are its victims.
So the question is: Why does it slow down every 15 minutes? Do we have some cron jobs running every quarter of an hour?
I think you might be right that it's a symptom not a cause. I tried disabling user_newtalk requests for anonymous users, but the problem still occurred. But instead of large numbers of user_newtalk requests blocking, it was article text requests. Common to both was that the blocked queries were all in the "statistics" state.
Is it a cron job? Well if so the period should be perfectly regular. In the captured output previously attached, there are 3 events, with the separations between the events being 23'20" and 19'25". Another file I have here shows a period of 22'05" between two events. Performance tends to slow down before it actually stops, so I'd put error bars on those times of about 20 seconds.
Currently I'm blaming the statistics state. Conventional wisdom is that a long statistics state is caused by a complex join with no specified index.
-- Tim Starling.
wikitech-l@lists.wikimedia.org