Hello, I run a sort of semi busy wiki, and I have been experiencing
difficulties with its CPU load lately, with load jumping to as high as 140
at noon (not 1.4, not 14, but ~140). Obviously this brought the site to a
crawl. After investigation I have found the course- multiple diff3
comparisons were called at the same time.
To explain the cause of this needs a little background explanation. The wiki
I run deals with the edit of large text files. It is common to see pages
with hundreds of kb of pure text on any given wiki page. Normally my servers
would be able to handle the edit requests of these pages.
However, it seems that searchbots/crawlbots (from both search engines and
individual users) have been hitting my wiki pretty hard lately. Each of
these bots tries to copy all the pages, this include Revision History of
each of these 100kb sized wiki text pages. Since each page could have
potentially hundreds of edits, for every single large text files, hundreds
of Revision history diff (from lighttpd/apache -> php5 -> diff3? ) are
spawned.
I have done some testing on my servers, and I found that each diff3
comparison of a typical large text page leads to a 3 increase of CPU load.
Right now I have implemented a few temporary restrictions-
1. Limit # of conn per IP
2. Disallow all search bots
3. increase ram limit in php config file
4. Memcache wherever it's possible (not all servers have memcache)
I have some problems with 1. and 2. . First of all, 1. doesn't really solve
the load problem. The slowdown could still occur if multiple bots hit the
site at the same time.
2. faces a similar problem. After I edited my rebots.txt, I discovered that
some clowns are ignoring my robots.txt. Also, only Google supports regular
expression in robots.txt, so I can't just use Disallow: *diff=* .
I don't want to break these large text pages up because it makes it harder
for scripts to compile the scripts together from the database directly.
So I turn my attention to system level optimization. Does anyone have any
experience with messing with diff3? Like for example switching to say
libxdiff? Or renice the fcgi? (I use lighttpd) Or is it possible to disable
Revision Comparison altogether for pages older than a certain age?
Thanks for the help
Tim