rationalwiki.org is currently serving pages very slowly. It's intermittent, but when it's slow it's a *slug*. Many users are getting 502 errors from Apache or 503 from the Squids.
We have one Linode doing Apache/MySQL/Lucene. It's an 8GB box with 8 cores. (Was 4GB/4 cores, but Linode just doubled everyone's server.) In front of that are two Squids fed by a load balancer.
* Sometimes the cause is obvious: when the load average is 30 and top shows a pile of Apaches using up CPU, then it's PHP handling a complex page request. (No, I still haven't made it PHP via fcgid.) * Sometimes it isn't, e.g. this afternoon when the site was running like a slug and load average was 0.8 with nothing amiss in top. * The squids don't show an unusual rate of hits on the site. * We have plenty of memory free - about 4GB on the main box is just sitting in file cache. * php_errors.log only shows up some processes timing out their 30 seconds (which would be the 502s).
So where would I start looking to work out what's going on?
- d.
Some of the things I look at when things are slow in no particular order:
* Check top. Look at memory usage (ensure no swap usage), CPU usage as well as deadlocked processes. * Check disk space. Ensure all drives have enough free space. * Check error logs. * See which server or component is being slow. Is it just one server? Is it static or dynamic pages? * Check the MySQL process list (SHOW PROCESSLIST). Generally this is no more than a couple of items and any more can indicate an issue. * Check hit rates in all caches. Is the cache getting filled up too quickly resulting in a low hit rate and high refresh rate? * If there's intermittent issues that are hard to track I would manually benchmark services (ex: ApacheBench) to try spot the issue and the cause. * See if there's any pattern to when the slow downs occur. See if they coincide with any cron scripts (ex: Lucene index updates).
Note that most of the above items should be applied to each individual server to try and narrow down the source. Since you appear to have 3 "content" servers (2 squids and one content) try directly accessing the wiki from each one too see if that narrows down the issue.
On 20 April 2013 15:29, David Gerard dgerard@gmail.com wrote:
rationalwiki.org is currently serving pages very slowly. It's intermittent, but when it's slow it's a *slug*. Many users are getting 502 errors from Apache or 503 from the Squids.
We have one Linode doing Apache/MySQL/Lucene. It's an 8GB box with 8 cores. (Was 4GB/4 cores, but Linode just doubled everyone's server.) In front of that are two Squids fed by a load balancer.
- Sometimes the cause is obvious: when the load average is 30 and top
shows a pile of Apaches using up CPU, then it's PHP handling a complex page request. (No, I still haven't made it PHP via fcgid.)
- Sometimes it isn't, e.g. this afternoon when the site was running
like a slug and load average was 0.8 with nothing amiss in top.
- The squids don't show an unusual rate of hits on the site.
- We have plenty of memory free - about 4GB on the main box is just
sitting in file cache.
- php_errors.log only shows up some processes timing out their 30
seconds (which would be the 502s).
So where would I start looking to work out what's going on?
- d.
MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
+1 for this discussion. While developing on a single 4-core 8GB EC2 with no http cache, no lucene, using apache and mysql, I notice that sometimes even when I'm the only user, the response from just a soft browser refresh (not CTRL-F5) it is very sluggish, but not sure if it's Amazon or not. Other times it's quite responsive.
________________________________ From: David Gerard dgerard@gmail.com To: MediaWiki announcements and site admin list mediawiki-l@lists.wikimedia.org Sent: Saturday, April 20, 2013 1:29 PM Subject: [MediaWiki-l] How to diagnose causes of site slowness?
rationalwiki.org is currently serving pages very slowly. It's intermittent, but when it's slow it's a *slug*. Many users are getting 502 errors from Apache or 503 from the Squids.
We have one Linode doing Apache/MySQL/Lucene. It's an 8GB box with 8 cores. (Was 4GB/4 cores, but Linode just doubled everyone's server.) In front of that are two Squids fed by a load balancer.
* Sometimes the cause is obvious: when the load average is 30 and top shows a pile of Apaches using up CPU, then it's PHP handling a complex page request. (No, I still haven't made it PHP via fcgid.) * Sometimes it isn't, e.g. this afternoon when the site was running like a slug and load average was 0.8 with nothing amiss in top. * The squids don't show an unusual rate of hits on the site. * We have plenty of memory free - about 4GB on the main box is just sitting in file cache. * php_errors.log only shows up some processes timing out their 30 seconds (which would be the 502s).
So where would I start looking to work out what's going on?
- d.
_______________________________________________ MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
On 21/04/13 05:29, David Gerard wrote:
So where would I start looking to work out what's going on?
If there is any kind of site issue at WMF, I usually start with Ganglia. It does take some practise to be able to read it correctly, but it gives you information far more quickly than just about anything else. My notes on WMF incident response give some hints about how to use it, as well as discussing some other tools:
https://wikitech.wikimedia.org/wiki/Incident_response
If the problem seems to be downstream of MediaWiki, then profiling is usually the next thing to look at. Wikipedia has been using DIY profiling to diagnose site performance issues since it was on a single server.
- Sometimes it isn't, e.g. this afternoon when the site was running
like a slug and load average was 0.8 with nothing amiss in top.
Processes in the "S" state do not contribute to the load average, whether or not users are waiting for them. For example, PHP may be waiting for Lucene. Try the section in the incident response notes under "slow backend service".
-- Tim Starling
First, has there been any configuration changes shortly before the problem began? The first rule is "look for stupidity", as in an error in configuration causing a self-DOS. Many of us have done that to ourselves, to our embarrassment. If not, go with Tim's suggestion and also look at squid's logs. Are you getting requests, but no full session (syn flood)?
I'm on your site periodically. It's normally smoothly running, since you went with Linode. The site is overall well behaved. However, it is one that could easily become the target of a script kiddie. So, do you have SYN cookies turned on?
I'm a sysadmin/netadmin, but I'm a bit colored from my information security experience. Hence, I always have to re-remind myself that stupidity is the most frequent cause of a problem, malicious intent the last.
The large number of httpd daemons can be php hits or SYN flooding, in a non-squid environment or even with a creatively crafted attack. The latter is beyond rare for anything non-super profile in nature (think Fortune 500 and government scale for that). But, the most common is a burst of intra-cranial flatulence or a case of fat fingers. So, look again at the logs and processes during the slug convention. Look from Tim's suggested perspective. If you can't find anything there, look closer at squid and connection based events. When working for the US DoD, our most common DOS was self-inflicted. In an environment where we were incessantly having DDOS, general DOS and every other form of attack attempted. Two, inflicted by my own humble fat fingers. :/
On Apr 21, 2013, at 11:53 PM, Tim Starling wrote:
On 21/04/13 05:29, David Gerard wrote:
So where would I start looking to work out what's going on?
If there is any kind of site issue at WMF, I usually start with Ganglia. It does take some practise to be able to read it correctly, but it gives you information far more quickly than just about anything else. My notes on WMF incident response give some hints about how to use it, as well as discussing some other tools:
https://wikitech.wikimedia.org/wiki/Incident_response
If the problem seems to be downstream of MediaWiki, then profiling is usually the next thing to look at. Wikipedia has been using DIY profiling to diagnose site performance issues since it was on a single server.
- Sometimes it isn't, e.g. this afternoon when the site was running
like a slug and load average was 0.8 with nothing amiss in top.
Processes in the "S" state do not contribute to the load average, whether or not users are waiting for them. For example, PHP may be waiting for Lucene. Try the section in the incident response notes under "slow backend service".
-- Tim Starling
MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
On 22/04/13 15:34, Stephen Villano wrote:
First, has there been any configuration changes shortly before the problem began? The first rule is "look for stupidity", as in an error in configuration causing a self-DOS. Many of us have done that to ourselves, to our embarrassment. If not, go with Tim's suggestion and also look at squid's logs. Are you getting requests, but no full session (syn flood)?
I'm on your site periodically. It's normally smoothly running, since you went with Linode. The site is overall well behaved. However, it is one that could easily become the target of a script kiddie. So, do you have SYN cookies turned on?
Most kinds of DoS attack, including SYN flooding, can be seen in Ganglia as a sharp increase in inbound network traffic, especially as measured by packet count (pkts_in).
SYN cookies are definitely a good idea, regardless of whether an attack is underway. They are enabled by default in Ubuntu.
-- Tim Starling
Thanks, Tim. Can you find an example to display a SYN attack in Ganglia? I looked at Wiki Ganglia displays, but no example was apparent. It would help the majority if an example of the attack pattern were displayed.
No opinion on the case in this instance, but more information is better decision making. :D
On Apr 22, 2013, at 2:33 AM, Tim Starling wrote:
On 22/04/13 15:34, Stephen Villano wrote:
First, has there been any configuration changes shortly before the problem began? The first rule is "look for stupidity", as in an error in configuration causing a self-DOS. Many of us have done that to ourselves, to our embarrassment. If not, go with Tim's suggestion and also look at squid's logs. Are you getting requests, but no full session (syn flood)?
I'm on your site periodically. It's normally smoothly running, since you went with Linode. The site is overall well behaved. However, it is one that could easily become the target of a script kiddie. So, do you have SYN cookies turned on?
Most kinds of DoS attack, including SYN flooding, can be seen in Ganglia as a sharp increase in inbound network traffic, especially as measured by packet count (pkts_in).
SYN cookies are definitely a good idea, regardless of whether an attack is underway. They are enabled by default in Ubuntu.
-- Tim Starling
MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
mediawiki-l@lists.wikimedia.org