Hi,
We are creating an off-line English Wikipedia mirror. By off-line, I only mean that it is not available from the Internet; only from a LAN which is not connected to the Internet. This will be deployed in locations where there is little or no Internet access (schools and universities in Africa for example). More info: http://en.wikipedia.org/wiki/EGranary_Digital_Library
The machines hosting the mirror are often lower-end machines without a lot of spare memory and we are tight on disk space. Generally there won't be a lot of traffic though. Also, the Wikipedia mirror is read-only; The main problem is that it is incredibly slow (even on our relatively fast server). Articles like Abraham_Lincoln can take several minutes. The longer the article and the more templates it uses, the longer it takes. I know that this is a common problem that people ask about for mediawiki installations and Wikipedia uses various levels of caching. Because of the above constraints, using Squid and the file cache don't really seem like viable options. We tried PHP accelerators without much benefit. Looking at the profiler log, it seems as if the the most time is spent by the parser. Pages speed up considerably after the initial access when we have the parser cache set to the CACHE_DB.
The plan: We want to deliver the Wikipedia in a form that makes it as fast as possible for the end user. The plan is to try and pre-cache all the articles. In other words, set the parser cache not to expire at all and try to hit all the articles one time to set the parser cache before we deploy the mirror. The assumption is that this will ultimately be less disk space than just creating a static copy of all the pages or even just creating a huge file cache (I think there is a management script that allows you to do this). This will also serve the problem of generating all the thumbnails ahead of time so that we don't have to do it on the fly. (We currently rewrite all requests to thumbnails to the original and let the browser resize them--this also makes the mirror appear to run slowly).
Any comments? Are our assumptions correct? Is there another way to go about this? Are there options we haven't thought of? Any comments on how to go about it?
Thanks to donations, we have all the machinery, storage, time, and processing power to do any pre-processing of Wikipedia assets -- we're seeking any ideas of things we can do in advance to make it run fast on end-user machines. Generally we plan on setting up multiple machines with the Wikipedia running on each one, all accessing one database and running multiple clients requesting articles simultaneously. I don't know if it will be feasible to have all the machines share the same filespace for thumbnail generation (NFS/SMB) but I might try that too.
Thanks in advance for your thoughts. Brent Widernet.org
MediaWiki 1.14.0 PHP 5.2.8 (apache2handler) MySQL 5.0.41-community-nt-log Apache 2.1 OS: various
mediawiki-l@lists.wikimedia.org