Periodic lag problem - Wikitech-l

9 Oct 2003


      I wrote a little perl script to monitor the health of Wikipedia by 
regularly checking the file size of the main log. The idea was that it 
would email me if apache crashed. I soon discovered something rather 
interesting and unexpected.
Once every 15 minutes or so, Wikipedia shuts down for a minute or two. 
This is bad, needless to say. During these episodes, it shows few if any 
log entries. Load is depressed slightly (from ~20 to ~10), so CPU 
starvation didn't seem likely. I suspected database troubles.
Attached is a zip file containing the output from SHOW PROCESSLIST. 
There is one file showing "typical" output, between the episodes, and 
three showing output in the middle of an episode. One thing stands out 
like a sore thumb:
SELECT 1 FROM user_newtalk WHERE user_ip='167.206.112.85'
SELECT 1 FROM user_newtalk WHERE user_ip='217.244.15.107'
SELECT 1 FROM user_newtalk WHERE user_ip='80.55.166.58'
SELECT 1 FROM user_newtalk WHERE user_ip='66.196.90.11'
SELECT 1 FROM user_newtalk WHERE user_ip='172.176.254.188'
SELECT 1 FROM user_newtalk WHERE user_ip='220.54.212.24'
SELECT 1 FROM user_newtalk WHERE user_id=7580
SELECT 1 FROM user_newtalk WHERE user_ip='194.78.48.226'
SELECT 1 FROM user_newtalk WHERE user_ip='80.225.14.29'
SELECT 1 FROM user_newtalk WHERE user_ip='144.32.128.73'
SELECT 1 FROM user_newtalk WHERE user_ip='62.216.15.127'
SELECT 1 FROM user_newtalk WHERE user_ip='134.151.225.179'
SELECT 1 FROM user_newtalk WHERE user_ip='213.81.145.123'
SELECT 1 FROM user_newtalk WHERE user_ip='129.137.208.159'
SELECT 1 FROM user_newtalk WHERE user_ip='195.93.72.17'
SELECT 1 FROM user_newtalk WHERE user_ip='218.19.141.2'
SELECT 1 FROM user_newtalk WHERE user_ip='68.11.187.242'
SELECT 1 FROM user_newtalk WHERE user_ip='63.34.208.93'
SELECT 1 FROM user_newtalk WHERE user_ip='81.196.21.16'
SELECT 1 FROM user_newtalk WHERE user_ip='202.156.2.82'
Millions of user_newtalk requests, conspicuously absent from the typical 
dump. Many of them are doing "statistics", whatever that means, and the 
rest are locked. One side of me is curious as to what "statistics" means 
and why it produces this behaviour. The other side of me says
KILL KILL KILL
I DON'T CARE WHY OR HOW IT'S HAPPENING JUST KILL IT NOW!!!!!!
ARRRRGGHH (axe hitting hard drive noises)
Anyone up for some memcached programming?
Also attached is some sample output from the monitor program. Each entry 
represents 5 seconds, and is expressed in units of the "danger 
threshold" of 100 bytes per second.
BTW this appears to be unrelated to whatever the problem is with it as I 
type, i.e. a long-term (~1hr) slowdown with load at 0.25 and log file 
turnover at ~20% of normal.
-- Tim Starling.