[Mediawiki-l] possible revision comparison optimization with diff3?

sebastien bracq sbracq at hotmail.fr
Sat Feb 28 22:45:46 UTC 2009




> From: thelastguardian at hotmail.com
> To: mediawiki-l at lists.wikimedia.org
> Date: Sat, 28 Feb 2009 12:21:13 -0800
> CC: thelastguardian at hotmail.com
> Subject: [Mediawiki-l] possible revision comparison optimization with diff3?
> 
> Hello, I run a sort of semi busy wiki, and I have been experiencing
> difficulties with its CPU load lately, with load jumping to as high as 140
> at noon (not 1.4, not 14, but ~140). Obviously this brought the site to a
> crawl. After investigation I have found the course- multiple diff3
> comparisons were called at the same time.
> 
>  
> 
> To explain the cause of this needs a little background explanation. The wiki
> I run deals with the edit of large text files. It is common to see pages
> with hundreds of kb of pure text on any given wiki page. Normally my servers
> would be able to handle the edit requests of these pages.
> 
>  
> 
> However, it seems that searchbots/crawlbots (from both search engines and
> individual users) have been hitting my wiki pretty hard lately. Each of
> these bots tries to copy all the pages, this include Revision History of
> each of these 100kb sized wiki text pages. Since each page could have
> potentially hundreds of edits, for every single large text files, hundreds
> of Revision history diff (from lighttpd/apache -> php5 -> diff3? ) are
> spawned.
> 
>  
> 
> I have done some testing on my servers, and I found that each diff3
> comparison of a typical large text page leads to a 3 increase of CPU load.
 
>  
> 
> Right now I have implemented a few temporary restrictions-
> 
> 1.      Limit # of conn per IP 
> 
> 2.      Disallow all search bots
> 
> 3.      increase ram limit in php config file
> 
> 4.      Memcache wherever it's possible (not all servers have memcache)
> 
>  
> 
> I have some problems with 1. and 2. . First of all, 1. doesn't really solve
> the load problem. The slowdown could still occur if multiple bots hit the
> site at the same time.
> 
> 2. faces a similar problem. After I edited my rebots.txt, I discovered that
> some clowns are ignoring my robots.txt. Also, only Google supports regular
> expression in robots.txt, so I can't just use Disallow: *diff=* .
> 
>  
> 
> I don't want to break these large text pages up because it makes it harder
> for scripts to compile the scripts together from the database directly.
> 
>  
> 
>  
> 
> So I turn my attention to system level optimization. Does anyone have any
> experience with messing with diff3? Like for example switching to say
> libxdiff? Or renice the fcgi? (I use lighttpd) Or is it possible to disable
> Revision Comparison altogether for pages older than a certain age?
> 
>  
> 
> Thanks for the help
> 
>  
> 
> Tim
> 
> _______________________________________________
> MediaWiki-l mailing list
> MediaWiki-l at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/mediawiki-l

Maybe these access of bots are normal and effects of the formation of wikipedia2 and of another project i work on www.wikilogos.org ? sorry. was i the creatorofwikipedia? maybe afterall whynot...

_________________________________________________________________
Téléphonez gratuitement à tous vos proches avec Windows Live Messenger  !  Téléchargez-le maintenant !
http://www.windowslive.fr/messenger/1.asp


More information about the MediaWiki-l mailing list