2006/7/7, Daniel Kinzler daniel@brightbyte.de:
I talked to Leon about ways to make hit counters feasible, for all projects. The core points are:
- Just like Edward did, use JS code trigger an HTTP request on page
views. But this should be throttled to a probability of 1% - or, for large projects, 0.1%. This should still give us usable stats for the most popular pages.
ruwiki TOP100 script shows about 300 hits for last (#100) place. It's better to handle at least 5-10% of all requests.
There's another optimization on client-side: my counter filtered any request to history, diff-s, pages not from article namespace etc. This should be added into JS script (sorry, I can't do so right now because I have no sysop rights).
- Just like Edward, use a persistent server, not cgi/php. To avoid
exposing home brewen hacks to the wild web, we should stick to something tried and true. I suggested to implement it as a Java servlet. Should be fairly straight forward, and we have Tomcat running anyway.
Please see source: http://tools.wikimedia.de/~edwardspec/src/wikicnt_daemon.pl This is written in Perl. Also, anything stange in HTTP connection results in breaking it without answer. The only potential security problem here is reading request line with my $req = <$c>; (no check for long lines - this is not fatal for Perl but might take some memory)
- To get around latency issues with the database, don't spawn (cause
more load on the already troubled DB); instead, cache updates in RAM for a minute or so, the flush the into the db in a single insert.
There's another problem. We need to save disk space too (seems like default MediaWiki counter was disabled because it consumes too much space - 4*12000*60*60*24 = 4147200000 bytes = 3955 Mb each day).
I used UPDATE statements insead. Yes, this is worse (for example, INSERT can be optimized by writing into text file and applying it with LOAD DATA LOCAL INFILE) but database can't become larger than 6 Mb (for ruwiki with 900000 articles).
- Edward used a lot of ram for a name -> id mapping. This should be
avoided - the name is unique, we don't need the page ID. If we want the ID, it should be determined on the wikipedia server and supplied with the request - I talked to Tim Starling about making this and other useful things available as JS variables.
That's much more efficient to store IDs (they are smaller and always fixed-sized). But actually this was requested by ruwiki users later in order to save counter value after _renaming_ article. With titles, this could be lost.
Now this is not a problem: the small copy of database (title as key and id as value) was moved into GDBM file and in-memory cache is now disabled. Database copy is updated each 00:00 (this takes 5-7 seconds) and takes 14 Mb of disk space.