So, rationalwiki.org has been *much* faster and more usable with a coupla squids and a load-balancer in front of the Apache/Lucene/database node. (We could probably cope with just one squid, but Trent wanted to experiment.) The nodes are all Ubuntu 10.04 Linodes, the software manually kept up to date.
Our problem now is that the Apache box sometimes ... just goes nuts, fills memory with Apache processes, then it goes into swap, then the oom-killer comes out to play and we have to work out what it's killed or (quicker) reboot the node.
We have had occasional load spikes - where the load-balancer sees someone or something hammering it at 300 hits/sec or so - but they *don't* always coincide with Apache going nuts. The squids don't show any excess load during Apache going nuts either.
If we happen to catch it when it's in swap but before oom-killer comes out, apache2ctl restart brings things back to normality.
The Apache node has 4GB memory, about 3GB of that being free/cache in normal operation.
We have NO IDEA what or why this is happening. Last happened around three days ago. Since then it's been lovely, but it always is until it falls over. Clues welcomed.
- d.
On Mon, Oct 29, 2012 at 6:14 PM, David Gerard dgerard@gmail.com wrote:
We have NO IDEA what or why this is happening. Last happened around three days ago. Since then it's been lovely, but it always is until it falls over. Clues welcomed.
#1 cause is max_clients set too high.
As OQ mentioned, if your MaxClients in Apache is set too high when you get a burst of connections like this you'll eat up all your RAM and begin swapping which kills performance real quick. If you don't want it to fail in this way you can tune your MaxClients to a point where even with a burst you don't use up all your RAM at the expense of dropping/refusing connections.
As a rough guess at what to set MaxClients to use "top" and look at the difference between the RES and SHR columns of all httpd processes. I believe this is roughly the amount of non-shared memory each of the child Apache processes are using (for example, I'm averaging 10 MB per process). Take the maximum amount of RAM you want Apache to use and divide it by this memory to get a rough number for MaxClients. You can monitor memory usage and adjust this as needed or through load testing.
Of course, this is a solution to the usage of swap and not the actual issue you are having. I would continue to look at what is causing the surge in requests. Is it a DoS of some sort (either accidental or on purpose) or is some part of the server stalling which is causing requests to pile up and overflow? For example, if the database is having an issue (a bunch of long queries) then all Apache requests will start piling up until you hit swap or the database issue resolves itself.
Along these lines I would suggest some sort of monitoring/logging service like Zabbix or Nagois (to name just two, there are many similar options to choose from). This helps you on two fronts: 1) Logging of Parameters and 2) Service Monitoring. Trying to diagnose issues after the fact can be difficult/impossible and with such a service you have a record of many things to help you pinpoint, or at least narrow down, the root cause of the issue. Monitoring is also invaluable as you can be emailed/texted when the issue actually happens (or is just starting) rather than 5 minutes after the website has begun timing out for everyone.
On 29 October 2012 19:14, David Gerard dgerard@gmail.com wrote:
So, rationalwiki.org has been *much* faster and more usable with a coupla squids and a load-balancer in front of the Apache/Lucene/database node. (We could probably cope with just one squid, but Trent wanted to experiment.) The nodes are all Ubuntu 10.04 Linodes, the software manually kept up to date.
....
On 29 October 2012 23:56, Dave Humphrey dave@uesp.net wrote:
As a rough guess at what to set MaxClients to use "top" and look at the difference between the RES and SHR columns of all httpd processes. I believe this is roughly the amount of non-shared memory each of the child Apache processes are using (for example, I'm averaging 10 MB per process). Take the maximum amount of RAM you want Apache to use and divide it by this memory to get a rough number for MaxClients. You can monitor memory usage and adjust this as needed or through load testing.
I just set MaxClients 50 (on the basis of fat apache2 processes having ~50MB discrepancy between RES and SHR). Let's see what happens.
Of course, this is a solution to the usage of swap and not the actual issue you are having. I would continue to look at what is causing the surge in requests. Is it a DoS of some sort (either accidental or on purpose) or is some part of the server stalling which is causing requests to pile up and overflow? For example, if the database is having an issue (a bunch of long queries) then all Apache requests will start piling up until you hit swap or the database issue resolves itself.
I see that in busy times, the CPU usage goes way up and a chunk of it is MySQL. I could be wrong, but this suggests to me complex requests to MediaWiki (e.g. logged in editors right-clicking diffs on an obscure page). I should probably profile MediaWiki, given we have a pile of custom extensions.
Along these lines I would suggest some sort of monitoring/logging service like Zabbix or Nagois (to name just two, there are many similar options to choose from). This helps you on two fronts: 1) Logging of Parameters and 2)
I live on our Munin graphs :-)
- d.
On Oct 30, 2012, at 12:12 PM, David Gerard dgerard@gmail.com wrote:
On 29 October 2012 23:56, Dave Humphrey dave@uesp.net wrote:
As a rough guess at what to set MaxClients to use "top" and look at the difference between the RES and SHR columns of all httpd processes. I believe this is roughly the amount of non-shared memory each of the child Apache processes are using (for example, I'm averaging 10 MB per process). Take the maximum amount of RAM you want Apache to use and divide it by this memory to get a rough number for MaxClients. You can monitor memory usage and adjust this as needed or through load testing.
I just set MaxClients 50 (on the basis of fat apache2 processes having ~50MB discrepancy between RES and SHR). Let's see what happens.
Of course, this is a solution to the usage of swap and not the actual issue you are having. I would continue to look at what is causing the surge in requests. Is it a DoS of some sort (either accidental or on purpose) or is some part of the server stalling which is causing requests to pile up and overflow? For example, if the database is having an issue (a bunch of long queries) then all Apache requests will start piling up until you hit swap or the database issue resolves itself.
I see that in busy times, the CPU usage goes way up and a chunk of it is MySQL. I could be wrong, but this suggests to me complex requests to MediaWiki (e.g. logged in editors right-clicking diffs on an obscure page). I should probably profile MediaWiki, given we have a pile of custom extensions.
Along these lines I would suggest some sort of monitoring/logging service like Zabbix or Nagois (to name just two, there are many similar options to choose from). This helps you on two fronts: 1) Logging of Parameters and 2)
I live on our Munin graphs :-)
I know it's heretically commercial software, but I believe in and have used New Relic at several client sites, and their free "lite" service level might well help diagnose further...
George William Herbert Sent from my iPhone
What heresy is present? One uses the tools to do a job. If the open source tools aren't up to the job, one goes to all other sources of course.
On Oct 30, 2012, at 6:28 PM, George Herbert wrote:
On Oct 30, 2012, at 12:12 PM, David Gerard dgerard@gmail.com wrote:
On 29 October 2012 23:56, Dave Humphrey dave@uesp.net wrote:
As a rough guess at what to set MaxClients to use "top" and look at the difference between the RES and SHR columns of all httpd processes. I believe this is roughly the amount of non-shared memory each of the child Apache processes are using (for example, I'm averaging 10 MB per process). Take the maximum amount of RAM you want Apache to use and divide it by this memory to get a rough number for MaxClients. You can monitor memory usage and adjust this as needed or through load testing.
I just set MaxClients 50 (on the basis of fat apache2 processes having ~50MB discrepancy between RES and SHR). Let's see what happens.
Of course, this is a solution to the usage of swap and not the actual issue you are having. I would continue to look at what is causing the surge in requests. Is it a DoS of some sort (either accidental or on purpose) or is some part of the server stalling which is causing requests to pile up and overflow? For example, if the database is having an issue (a bunch of long queries) then all Apache requests will start piling up until you hit swap or the database issue resolves itself.
I see that in busy times, the CPU usage goes way up and a chunk of it is MySQL. I could be wrong, but this suggests to me complex requests to MediaWiki (e.g. logged in editors right-clicking diffs on an obscure page). I should probably profile MediaWiki, given we have a pile of custom extensions.
Along these lines I would suggest some sort of monitoring/logging service like Zabbix or Nagois (to name just two, there are many similar options to choose from). This helps you on two fronts: 1) Logging of Parameters and 2)
I live on our Munin graphs :-)
I know it's heretically commercial software, but I believe in and have used New Relic at several client sites, and their free "lite" service level might well help diagnose further...
George William Herbert Sent from my iPhone _______________________________________________ MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
mediawiki-l@lists.wikimedia.org