Server-side File Caching Using a 404 Handler - Wikitech-l

4 Dec 2003


      We have a good server-side file caching system already. I'm proposing
another one.
It's almost always true with Web servers in general and Apache in
particular that serving static files from the file system is going to
be much, much faster -- usually an order of magnitude faster -- than
doing any dynamic processing. In other words, any PHP page is going to
be significantly slower than a plain HTML page.
For this reason, I'm proposing using a directory in a MediaWiki
installation as a static file cache. Here's how it would work:
* There's a Web directory where static pages go -- say, "/var/www/wiki".
* URLs for articles point to that directory: the link for
  "Foo Bar" renders to "/wiki/Foo_Bar".
* Initially, an HTTP hit for "/wiki/Foo_Bar" will fail -- no page
  there.
* Apache has a directive -- ErrorDocument -- for mapping a script to
  handle missing files (among other errors). We map a script --
  /w/404Handler.phtml or something -- to handle these errors for the
  "/wiki/" directory.
<Directory /var/www/wiki/>
    ErrorDocument 404 /w/404Handler.phtml
  </Directory>
* The 404 handler tries to retrieve the article from the database. If
  the article doesn't exist, it shows an edit form for the article,
  just like a broken link works now. If the article is Special:, it
  lets wiki.phtml do the work.
* If the article _does_ exist, the handler renders the page *to a file
  in the cache directory*, e.g. "/wiki/Foo_Bar.html". It then opens
  the file and serves it to the current user, too. (There are probably
  some other ways to optimize this, but the best I can see is that
  they would require two hits to the server, which is wasteful. It's
  easier and cheaper just to read the file that was written.)
(Of course, it returns an HTTP 200 response on success.)
* For the next hit to come in for "/wiki/Foo_Bar", the Web server will
  find the file there (MultiViews will find the ".html" version -- a
  future 404 handler might also write out ".xml" or other document
  file formats), and thus serve it directly, without running any PHP
  code.
* On saving an edit of a page, the software simply *deletes* the
  cached HTML file. This will trigger the 404 handler on the next hit
  for the page, which will regenerate the cached file.
Similar cache invalidation would be necessary for moving or deleting
  a page.
* If an edit would change the appearance of links to a page (say,
  making a new article (broken link to good link), or deleting an
  article (good link to broken link), or crossing the default stub
  boundary (if it exists)), saving a page would also delete the cached
  versions of any pages that link to the changed page. These, also,
  will be regenerated by the 404 handler on the next hit (with the new
  link appearance).
* A garbage-collection cron job runs every N minutes to keep the size
  of the cache to a reasonable level (# bytes, # files). It deletes
  the least recently used pages, by filesystem access time.
(There's a possible race condition where a huge influx of activity
  could overflow the cache between garbage collection runs. If the
  installation is susceptible to this -- say, it has hard disk space
  limits -- garbage collection could happen in the 404 handler rather
  than in a cron job. This would obviously slow display of pages,
  though.)
* Of course, logged-in users want to get their pages rendered
  _just_so_, with question marks and [edit] links, etc. They should
  also have all their "My page" and other links working.
It may be possible to do most of this with JavaScript -- checking
  the UserId cookie, and showing and hiding parts based on that. Then
  the same .html file can be served to logged-in and not-logged-in
  users, with the client side doing the customization.
But probably the easiest way is just to run the dynamic pages every
  time for logged in users. We can use the Apache rewrite engine to
  serve different stuff, based on whether you're logged in or not. We
  use a RewriteCond line to check if the UserId cookie exists:
RewriteCond %{HTTP_COOKIE} ^.*UserId=.*$
  RewriteRule ^/wiki/(.*)$ /wiki.phtml?title=${ampescape:$1} [L]
This kinda depends on having a significant number of users
  not-logged-in. If 100% of users are logged in, we get no benefits,
  and a slight cost from Rewrite-checking the cookie on every hit.
* A more aggressive version could try to cache some of the other
  features of MediaWiki, like user contributions, page histories,
  etc., with the same strategy.
Some advantages with this design:
* Reduce the amount of dynamic page servicing
* Apache handles all the tweaky optimizations like compression,
  content negotiation, client-side caching, etc. It simplifies the
  MediaWiki code.
Some disadvantages I see with this design:
* There'd probably be some problems with funky characters -- spaces,
  punctuation, etc. "/wiki/Foo Bar" and "/wiki/Foo_Bar" should
  probably (usually) map to the same file. Some work needed here.
* You lose view counting. However, a periodic Web server log checker
  could provide the same functionality -- reading the Web log,
  updating the database with new view counts.
* Directory entries. Having a few thousand files in a single directory
  could get kinda losey for some file systems. There may be some
  rewrite tricks that could allow writing multiple sub-directories,
  like we do with images in the /upload dir right now, but having them
  appear as being in the /wiki directory.
* Redirect articles are tricky.
Anyways, I've been thinking about this for a while, thought I'd throw
it out here to get responses. I'll probably rewrite this and put it on
meta soon.
~ESP
-- 
Evan Prodromou evan@wikitravel.org
Wikitravel - http://www.wikitravel.org/
The free, complete, up-to-date and reliable world-wide travel guide