---------- Forwarded message ----------
From:
Brent Palmer <bop@brentopalmer.com>Date: 2010/2/2
Subject: [Mediawiki-l] Wikipedia mirror speedup
To:
mediawiki-l@lists.wikimedia.org Hi,
We are creating an off-line English Wikipedia mirror. By off-line, I
only mean that it is not available from the Internet; only from a
LAN which is not connected to the Internet. This will be deployed in
locations where there is little or no Internet access (schools and
universities in Africa for example). More info:
http://en.wikipedia.org/wiki/EGranary_Digital_Library
The machines hosting the mirror are often lower-end machines without
a lot of spare memory and we are tight on disk space. Generally
there won't be a lot of traffic though. Also, the Wikipedia mirror
is read-only; The main problem is that it is incredibly slow (even
on our relatively fast server). Articles like Abraham_Lincoln can
take several minutes. The longer the article and the more templates
it uses, the longer it takes. I know that this is a common problem
that people ask about for mediawiki installations and Wikipedia uses
various levels of caching. Because of the above constraints, using
Squid and the file cache don't really seem like viable options. We
tried PHP accelerators without much benefit. Looking at the
profiler log, it seems as if the the most time is spent by the
parser. Pages speed up considerably after the initial access when we
have the parser cache set to the CACHE_DB.
The plan:
We want to deliver the Wikipedia in a form that makes it as fast as
possible for the end user. The plan is to try and pre-cache all the
articles. In other words, set the parser cache not to expire at all
and try to hit all the articles one time to set the parser cache
before we deploy the mirror. The assumption is that this will
ultimately be less disk space than just creating a static copy of
all the pages or even just creating a huge file cache (I think there
is a management script that allows you to do this). This will also
serve the problem of generating all the thumbnails ahead of time so
that we don't have to do it on the fly. (We currently rewrite all
requests to thumbnails to the original and let the browser resize
them--this also makes the mirror appear to run slowly).
Any comments?
Are our assumptions correct? Is there another way to go about this?
Are there options we haven't thought of? Any comments on how to go
about it?
Thanks to donations, we have all the machinery, storage, time, and
processing power to do any pre-processing of Wikipedia assets --
we're seeking any ideas of things we can do in advance to make it
run fast on end-user machines. Generally we plan on setting up
multiple machines with the Wikipedia running on each one, all
accessing one database and running multiple clients requesting
articles simultaneously. I don't know if it will be feasible to have
all the machines share the same filespace for thumbnail generation
(NFS/SMB) but I might try that too.
Thanks in advance for your thoughts.
Brent
Widernet.org
MediaWiki 1.14.0
PHP 5.2.8 (apache2handler)
MySQL 5.0.41-community-nt-log
Apache 2.1
OS: various
_______________________________________________
MediaWiki-l mailing list
MediaWiki-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/mediawiki-l