[Mediawiki-l] Wikipedia mirror speedup

Brent Palmer bop at brentopalmer.com
Tue Feb 2 14:29:29 UTC 2010


      Hi,

    We are creating an off-line English Wikipedia mirror. By off-line, I
    only mean that it is not available from the Internet; only from a
    LAN which is not connected to the Internet. This will be deployed in
    locations where there is little or no Internet access (schools and
    universities in Africa for example). More info:
    http://en.wikipedia.org/wiki/EGranary_Digital_Library

    The machines hosting the mirror are often lower-end machines without
    a lot of spare memory and we are tight on disk space. Generally
    there won't be a lot of traffic though. Also, the Wikipedia mirror
    is read-only; The main problem is that it is incredibly slow (even
    on our relatively fast server). Articles like Abraham_Lincoln can
    take several minutes. The longer the article and the more templates
    it uses, the longer it takes. I know that this is a common problem
    that people ask about for mediawiki installations and Wikipedia uses
    various levels of caching. Because of the above constraints, using
    Squid and the file cache don't really seem like viable options. We
    tried PHP accelerators without much benefit.  Looking at the
    profiler log, it seems as if the the most time is spent by the
    parser. Pages speed up considerably after the initial access when we
    have the parser cache set to the CACHE_DB.

    The plan:
    We want to deliver the Wikipedia in a form that makes it as fast as
    possible for the end user. The plan is to try and pre-cache all the
    articles. In other words, set the parser cache not to expire at all
    and try to hit all the articles one time to set the parser cache
    before we deploy the mirror. The assumption is that this will
    ultimately be less disk space than just creating a static copy of
    all the pages or even just creating a huge file cache (I think there
    is a management script that allows you to do this). This will also
    serve the problem of generating all the thumbnails ahead of time so
    that we don't have to do it on the fly. (We currently rewrite all
    requests to thumbnails to the original and let the browser resize
    them--this also makes the mirror appear to run slowly).

    Any comments?
    Are our assumptions correct? Is there another way to go about this?
    Are there options we haven't thought of? Any comments on how to go
    about it?

    Thanks to donations, we have all the machinery, storage, time, and
    processing power to do any pre-processing of Wikipedia assets --
    we're seeking any ideas of things we can do in advance to make it
    run fast  on end-user machines. Generally we plan on setting up
    multiple machines with the Wikipedia running on each one, all
    accessing one database and running multiple clients requesting
    articles simultaneously. I don't know if it will be feasible to have
    all the machines share the same filespace for thumbnail generation
    (NFS/SMB) but I might try that too.

    Thanks in advance for your thoughts.
    Brent
    Widernet.org

    MediaWiki 1.14.0
    PHP 5.2.8 (apache2handler)
    MySQL 5.0.41-community-nt-log
    Apache 2.1
    OS: various 




More information about the MediaWiki-l mailing list