---------- Forwarded message ----------
From: Brent Palmer <bop@brentopalmer.com>
Date: 2010/2/2
Subject: [Mediawiki-l] Wikipedia mirror speedup
To: mediawiki-l@lists.wikimedia.org


     Hi,

   We are creating an off-line English Wikipedia mirror. By off-line, I
   only mean that it is not available from the Internet; only from a
   LAN which is not connected to the Internet. This will be deployed in
   locations where there is little or no Internet access (schools and
   universities in Africa for example). More info:
   http://en.wikipedia.org/wiki/EGranary_Digital_Library

   The machines hosting the mirror are often lower-end machines without
   a lot of spare memory and we are tight on disk space. Generally
   there won't be a lot of traffic though. Also, the Wikipedia mirror
   is read-only; The main problem is that it is incredibly slow (even
   on our relatively fast server). Articles like Abraham_Lincoln can
   take several minutes. The longer the article and the more templates
   it uses, the longer it takes. I know that this is a common problem
   that people ask about for mediawiki installations and Wikipedia uses
   various levels of caching. Because of the above constraints, using
   Squid and the file cache don't really seem like viable options. We
   tried PHP accelerators without much benefit.  Looking at the
   profiler log, it seems as if the the most time is spent by the
   parser. Pages speed up considerably after the initial access when we
   have the parser cache set to the CACHE_DB.

   The plan:
   We want to deliver the Wikipedia in a form that makes it as fast as
   possible for the end user. The plan is to try and pre-cache all the
   articles. In other words, set the parser cache not to expire at all
   and try to hit all the articles one time to set the parser cache
   before we deploy the mirror. The assumption is that this will
   ultimately be less disk space than just creating a static copy of
   all the pages or even just creating a huge file cache (I think there
   is a management script that allows you to do this). This will also
   serve the problem of generating all the thumbnails ahead of time so
   that we don't have to do it on the fly. (We currently rewrite all
   requests to thumbnails to the original and let the browser resize
   them--this also makes the mirror appear to run slowly).

   Any comments?
   Are our assumptions correct? Is there another way to go about this?
   Are there options we haven't thought of? Any comments on how to go
   about it?

   Thanks to donations, we have all the machinery, storage, time, and
   processing power to do any pre-processing of Wikipedia assets --
   we're seeking any ideas of things we can do in advance to make it
   run fast  on end-user machines. Generally we plan on setting up
   multiple machines with the Wikipedia running on each one, all
   accessing one database and running multiple clients requesting
   articles simultaneously. I don't know if it will be feasible to have
   all the machines share the same filespace for thumbnail generation
   (NFS/SMB) but I might try that too.

   Thanks in advance for your thoughts.
   Brent
   Widernet.org

   MediaWiki 1.14.0
   PHP 5.2.8 (apache2handler)
   MySQL 5.0.41-community-nt-log
   Apache 2.1
   OS: various


_______________________________________________
MediaWiki-l mailing list
MediaWiki-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/mediawiki-l



--
{+}Nevinho
Venha para o Movimento Colaborativo http://sextapoetica.com.br !!