Well, things seem sort of normal again now. There is no definitive answer as to what was going on. It seems likely to have had something to do with the apc cache.
Brion has a very safe looking script to clear the apc cache, I see nothing wrong with running it. But it seemed "stuck", and the site was very unhappy, and when he kill -9'd it, the site seemed better.
But at roughly the same time, he disabled apc caching completely, which is probably a negative for performance to some extent.
So, we're running again, but I'm not at all confident that we understand the problem.
wikitech-l@lists.wikimedia.org