Hi all,
*TL;DR - How I can determine what Apache processes are occasionally stuck
waiting on (often leading to 502s due to hitting MaxClients) when trying to
service MediaWiki requests?*
I'm dealing with a problem where occasionally one or more of my wiki
servers will hit its Apache limit of 100 connections (calculated based on
total server memory and per-Apache process memory usage). Sometimes it will
clear up on its own, often not. This is on Ubuntu 12.04 and often I'll see
Apache processes stuck in the "sending" state via the server-status page,
but I cannot figure out what it's waiting on aside from the Request column
on the /server-status page, whenever I can actually request that page since
Apache is usually unresponsive due to being at MaxClients. Other times,
though, I'll see a stuck process despite being below MaxClients, but again
I cannot figure out what the process is stuck waiting on. I have a hunch
that it's image-heavy pages due to tons of thumbnails, but my wiki
community controls that as I'm not a MediaWiki editor and that doesn't help
me in figuring out what processes get stuck on.
I've tried strace, lsof, pstack, viewing /proc/$pid/stack directly, Apache
logs, etc. but none of that has helped me figure out why some processes
hang, thus crowding out new ones and often leading to 502s. I have four
load-balanced web servers and this hitting MaxClients sometimes happens on
all four, leading to steady 502s, while other times it can be fewer than
four servers, leading to broken-looking pages and/or intermittent 502s.
Architectural considerations:
* Each web server runs Varnish on port 80 with Apache (using APC) 2.2
hosting several MediaWiki 1.24.2 wikis as named-based vhosts on
127.0.0.01:8080 as the Varnish backend (and all four Varnishes in
$wgMemCachedServers). Varnish connections tend to remain steady in their
usual patterns while the Apaches spike to MaxClients, so it's not an
unusual spike in Internet traffic to the wikis.
* Six wikis are configured as Vhosts in Apache, load balanced by a separate
set of front-end servers, where two of the wikis are for private internal
use and the other four are public, though the traffic to one of the public
wikis dwarfs the rest and it's the wiki giving me problems.
* The upload directory is a symlink into an NFS-mounted filesystem with a
subdirectory per wiki, e.g. $IP/images is a symlink to
/var/www/images/$wikiname, where /var/www/images is the NFS mount. I never
see NFS issues and the NFS server's Graphite dashboard shows the server to
be *very* lightly loaded.
* Apache talks to a separate beefy MySQL server and two dedicated Memcached
servers for session and query caching. There are four MySQL instances on
the server, via mysqld_multi, but the problem wiki's database has its own
dedicated instance, so I am able to separate out database traffic from the
rest of the wikis (and another web application that uses its own dedicated
instance) and manage and tune the wiki's database instance independently.
I'm mainly looking right now for how to troubleshoot the stuck processes,
but any advice regarding this architecture is also welcome, as I feel it
could use some improvement but I'm not sure how just yet.
Justin