On Wed, Oct 26, 2011 at 1:46 PM, Daniel Barrett <danb(a)vistaprint.com> wrote:
[snip]
The problems I am seeing are:
* Sometimes an individual "Save Page" operation will sit for 20-30
seconds before completing.
* Occasionally some pages take a long time to render (10-15 seconds)
for no discernable reason. (This is not due to the live SQL queries
mentioned above.)
I'd like to eliminate these delays and decrease page rendering time to 0.1
second or less.
I have determined that our extensions do not slow the wiki down much. After
removing all of them, the speed stays about the same.
Given our architecture, what's the best next step we should investigate to
improve performance?
A few quick comments on each of these:
* File cache or Squid? (And is there some easy
way to tell Squid to
exclude our dynamic SQL pages? They all run a particular wiki extension, so
if there's something programmatic we can do in the extension, that's great.
I am not very familiar with Squid.)
The file cache mode probably isn't as well tested as it used to be, but
setting up squid or varnish can be a bit intimidating.
As long as your extension disables caching on the pages that use it, I think
either should work fine on the other pages and safely let things through.
Note though that both methods will only cache non-logged in page views; once
someone has a login session going, they're not going to get cached there. If
much of your traffic is from logged-in users, it may not help much.
* memcached?
I would strongly recommend using memcached in place of eAccelerator for data
caching for the following reasons:
* it's a prerequisite for adding more web servers (eAccel's cache is not
shared between servers)
* this is required to maintain consistency if you also have to run
command-line maintenance scripts (eAccel's cache is not shared between web
and CLI)
* you can make the cache size as large as you like, including sharding over
multiple machines!
* Increase number of CPUs?
More CPUs means the web server can process more web requests in parallel,
and can definitely be a win under load. In a VM environment this should be
easy to try -- bump up the CPU count, reboot, and see how it changes your
performance profile under load.
* Multiple front-end servers?
In the Apache+PHP web execution model, multiple front-end servers and more
CPUs on the first front-end server are very similar.
If the VMs will all run on the same physical machine then there may not be
much benefit against just assigning more CPU and RAM to the first VM. But if
you need more VMs to spread them over multiple physical hosts, that's the
way to go.
See the notes about about switching from eAccelerator to memcached; you'd
also need to make sure that PHP's session storage is shared between all
front-end servers. (If it's not, you can use the $wgSessionsInMemcache
switch to stick them into the shared cache space.)
> * Change from a VM to a physical machine?
This may well help; I'd recommend prepping a test by adding another frontend
as a VM, and then swap in the same configuration on a real machine and see
how they behave.
> * Move mySQL to a separate server (possibly physical)?
Potentially also a help; MySQL will have its own needs for RAM, disk access
etc which may not fit well with the virtualization. This is usually a
service that's very easy to try moving out since it's reasonably isolated;
you only need move it out and switch the LocalSettings.php to point at the
new IP.
> * Something else?
A few things off the top of my head:
* definitely keep using eAccelerator or APC or another PHP opcode cache:
this is key to getting decent basic large-PHP-app performance
* double-check for slow queries on the database; MySQL has logging options
that can help.
* if using PHP's default filesystem storage for sessions, consider also
trying the memcached sessions option -- default filesystem sessions perform
locking serialization which prevents multiple requests from the _same user_
from running at once, which could cause slowness in a few scenarios. Then
again it might not hit you at all. :)
* on slow page saves etc; check whether that time is being spent in the page
itself or if that's from additional cleanup jobs running. You could try
enabling the job queue so it runs cleanup tasks separately (such as
invalidating caches & rebuilding links on pages using a template you've
edited)
http://www.mediawiki.org/wiki/Manual:Job_queue
* When making performance measurements, look at outliers -- it's often not
the average case, but the *horrible horrible 10000x worse worst case* that
kills you. :)
There may also be other things like localization recaching that are
expensive. A larger memcache space might help with those, not sure.
-- brion