On 18/10/12 09:25, David Gerard wrote:
Whenever an article hits Reddit,[1] the server suffers under the load. Typically it goes into swap and thrashes itself to death. If we're really lucky the oom-killer comes out to play and shoots things randomly (usually Apache, maybe Lucene). The fun bit: sometimes it does this for no visible reason, just tips over into swap and promptly stops talking to the world (my shell session still works slowly).
It's funny how everyone is telling you how to use less CPU when your problem is actually memory.
I think you should switch everything to FastCGI, and use a single FastCGI process pool for all wikis. Reduce the maximum number of FastCGI workers severely, until the PHP memory_limit multiplied by the maximum worker count is less than the amount of memory you have available for PHP (i.e. physical RAM minus memcached, lucene, etc.)
The point of this is to decouple Apache's MaxChildren from the maximum memory usage. It's essential to have a high MaxChildren on an Apache installation that's directly serving remote users, because Apache will have a lot of threads just waiting around for communication with the remote users to complete, even if you disable keepalive.
With FastCGI, you can have a tiny PHP process pool, and in the event of high load, client connections will politely queue in Apache waiting for a FastCGI slot, instead of all trying to run PHP at once and sending your server into swap death.
Once you've done that, you should then disable swap. I am generally anti-swap -- having swap means that instead of a single process being killed when the server runs out of memory, the whole server becomes unresponsive instead, often requiring a power-cycle. But it's especially bad to use swap on Linode, where I/O can be so slow that even light swapping can cause the server to be unresponsive.
You can use /proc/[pid]/oom_adj to reduce the chance of oom-killer killing Lucene or some other useful process. oom-killer is weird and buggy and sometimes just does its own thing, but you may as well at least try to teach it some manners. Android uses oom_adj to control memory usage on phones, it seems to work for them.
An interesting thing about FastCGI is that you can run the workers in a chroot. If you have 4GB of memory, then I guess you are using a 64-bit Linux distribution. In the worst case, a 64-bit architecture will have double the memory usage of a 32-bit architecture, due to pointer sizes. It turns out that some things MW does are not very far away from that worst case. The schroot "personality" parameter makes it easy to install a chroot environment for PHP which uses 32-bit binaries on a 64-bit host.
If you reduce MW's typical memory usage to say 2/3 of its current value, then you can reduce the memory_limit by the same factor, which implies that you can increase the FastCGI process pool size by 50% for a corresponding increase in maximum throughput, assuming CPU is not maxed out.
-- Tim Starling