[Toolserver-l] fwd: wikicnt_daemon.pl

Sat Jul 8 03:08:33 UTC 2006

2006/7/7, Daniel Kinzler <daniel at brightbyte.de>:
>
> I talked to Leon about ways to make hit counters feasible, for all
> projects. The core points are:
>
> * Just like Edward did, use JS code trigger an HTTP request on page
> views. But this should be throttled to a probability of 1% - or, for
> large projects, 0.1%. This should still give us usable stats for the
> most popular pages.

ruwiki TOP100 script shows about 300 hits for last (#100) place. It's
better to handle at least 5-10% of all requests.

There's another optimization on client-side: my counter filtered any
request to history, diff-s, pages not from article namespace etc. This
should be added into JS script (sorry, I can't do so right now because
I have no sysop rights).

>
> * Just like Edward, use a persistent server, not cgi/php. To avoid
> exposing home brewen hacks to the wild web, we should stick to something
> tried and true. I suggested to implement it as a Java servlet. Should be
> fairly straight forward, and we have Tomcat running anyway.
Please see source:
 http://tools.wikimedia.de/~edwardspec/src/wikicnt_daemon.pl
This is written in Perl. Also, anything stange in HTTP connection
results in breaking it without answer. The only potential security
problem here is reading request line with
	my $req = <$c>;
(no check for long lines - this is not fatal for Perl but might take
some memory)

>
> * To get around latency issues with the database, don't spawn (cause
> more load on the already troubled DB); instead, cache updates in RAM for
> a minute or so, the flush the into the db in a single insert.
There's another problem. We need to save disk space too (seems like
default MediaWiki counter was disabled because it consumes too much
space - 4*12000*60*60*24 = 4147200000 bytes = 3955 Mb each day).

I used UPDATE statements insead. Yes, this is worse (for example,
INSERT can be optimized by writing into text file and applying it with
LOAD DATA LOCAL INFILE) but database can't become larger than 6 Mb
(for ruwiki with 900000 articles).

>
> * Edward used a lot of ram for a name -> id mapping. This should be
> avoided - the name is unique, we don't need the page ID. If we want the
> ID, it should be determined on the wikipedia server and supplied with
> the request - I talked to Tim Starling about making this and other
> useful things available as JS variables.
That's much more efficient to store IDs (they are smaller and always
fixed-sized). But actually this was requested by ruwiki users later in
order to save counter value after _renaming_ article. With titles,
this could be lost.

Now this is not a problem: the small copy of database (title as key
and id as value) was moved into GDBM file and in-memory cache is now
disabled. Database copy is updated each 00:00 (this takes 5-7 seconds)
and takes 14 Mb of disk space.

-- 
Edward Chernenko <edwardspec at gmail.com>