The last batch of fun upgrades included use of gzip compression on cached pages where browsers accept it. I'm happy to report that this seems to have decreased the bandwidth usage of the English-language Wikipedia by up to roughly 25%.
Data from http://www.wikipedia.org/stats/usage_200306.html
kilobytes / hits = kb/hit 2003-06-01 6392951 / 619534 = 10.319 \ 2003-06-02 7908928 / 793065 = 9.973 | 2003-06-03 8267879 / 822025 = 10.058 | mean 10.1 2003-06-04 7513917 / 755482 = 9.946 | range 0.37 2003-06-05 7347843 / 723717 = 10.153 | 2003-06-06 6300476 / 614552 = 10.252 | 2003-06-07 5159151 / 503097 = 10.255 | 2003-06-08 5732741 / 566484 = 10.120 / -- gzip cache activated: -- 2003-06-09 5376987 / 726971 = 7.396 \ 2003-06-10 5442685 / 732897 = 7.426 | mean 7.6 2003-06-11 5735325 / 765204 = 7.495 | range 0.85 2003-06-12 6362049 / 772002 = 8.241 /
These counts include pages, images, css, everything. (But not the other languages, mailing list, or database dump downloads.)
The bandwidth usage did go up a bit today, so it remains to be seen just how stable the effect is. A number of things can affect it:
* Since caching, and thus sending of gzipped cached pages, is currently only done for anonymous users, an increase in activity by registered users would tend to reduce the overall percentage of savings
* Lots of editing and loading of dynamic pages, which are not compressed, would do the same
* A large increase in brief visits by newbies drawn in by a link or news mention would increase the relative bandwidth used by linked items (logo, style sheet, etc) which are not additionally compressed
* Lots of work with new images might increase bandwidth usage.
Other thoughts: - So far, no one's complained about being unable to read pages due to pages being sent compressed when they shouldn't. (There was in fact a bug to this effect which made the wiki unusable with Safari, but I don't know if anyone but me noticed before I fixed it. :)
- Since gzipping is done only at cache time, this should use very little CPU. IIRC the gzip was one of the faster steps when I was test profiling this ;) and the number of times gzipping is done should not generally exceed the number of edits * some factor regarding creation/deletion rates and number of links per page. (More of course when the cache has to be regenerated en masse.)
- The page cache presently stores both uncompressed and compressed copies of pages, which is space inefficient, though we're not presently hurting for space on larousse. Someone suggested storing just the compressed pages, then in the relatively rare case a browser won't accept gzipped pages, we can unzip it on the fly.
- We could offer either as default or as an option to compress dynamically generated pages as well, which could shave some more percentage points off the bandwidth usage. Might be a help for the modem folks who do log in. :) However I'm not sure how much this would affect CPU usage; in any case there's no urgency for this, it's just something we might do if we have the cycles to burn (I don't think we do just now, but we might one day).
-- brion vibber (brion @ pobox.com)
On Thu, Jun 12, 2003 at 11:02:37PM -0700, Brion Vibber wrote:
- We could offer either as default or as an option to compress
dynamically generated pages as well, which could shave some more percentage points off the bandwidth usage. Might be a help for the modem folks who do log in. :) However I'm not sure how much this would affect CPU usage; in any case there's no urgency for this, it's just something we might do if we have the cycles to burn (I don't think we do just now, but we might one day).
Modem people have hardware compression in their analog modems. ISDN people might benefit, since they normally don't have link compression.
JeLuF
--On Donnerstag, 12. Juni 2003 23:02 Uhr -0700 Brion Vibber brion@pobox.com wrote:
- We could offer either as default or as an option to compress
dynamically generated pages as well, which could shave some more percentage points off the bandwidth usage. Might be a help for the modem folks who do log in. :) However I'm not sure how much this would affect CPU usage; in any case there's no urgency for this, it's just something we might do if we have the cycles to burn (I don't think we do just now, but we might one day).
We do dynamic gzipping of pages on a rather large website (~3.000.000 dynamic hits daily). The experience we gathered so far showed us, that the gzipping itself is actually rather fast, compared to the page generation process through PHP/Perl. The main problem with dynamic gzipping is, that you have to build up the whole page in memory, instead of sending out lines as they are generated (don't know, how the Wikipedia software currently works). As a safeguard, we occasionally (about once a minute per Apache process) read /proc/loadavg on Linux systems. If it's higher than a specified limit (9.0 on our systems) we temporarily disable page gzipping.
Some other optimization-related suggestions (I'm not familiar with what was already suggested, sorry):
- Drop Apache for the image delivery. Instead, put a webserver like thttpd (http://www.acme.com/software/thttpd/) on a subdomain for image delivery. The amount of delivered hits compared to used memory and CPU time is significantly better than with Apache in our experiences. - Consider implementing Squid as a front-end to your dynamic Apache. It's fairly fast to implement if your software delivers proper headers. This implements caching for anon-users without extra code. Even for logged in users it has a serious advantage: Apache need not wait anymore till it can send out all data. Usually, by doing that, the load on the servers will increase, as it sits less idly, waiting for the traffic to be sent out, but at the same time, more pages are delivered per second. Dynamic Apaches are usually also very sensitive to bad connections as you usually only have a small limited pool of processes (we run a maximum of 70 on our site for example) - if the connections of users are generally bad, they can easily hog 99% of your apache processes in the starting connection state and thereby basically bring Wikipedia down, even though 95% of your server ressources are not actually used. A Squid frontend usually delivers increased performance even if you completely disable caching in Squid. A drawback though is the increased number of context switches/second and more memory-copy operations.
wikitech-l@lists.wikimedia.org