Re: [Toolserver-l] fwd: wikicnt_daemon.pl

8 Jul 2006

      2006/7/7, Daniel Kinzler daniel@brightbyte.de:
...
I talked to Leon about ways to make hit counters feasible, for all
projects. The core points are:

Just like Edward did, use JS code trigger an HTTP request on page

views. But this should be throttled to a probability of 1% - or, for
large projects, 0.1%. This should still give us usable stats for the
most popular pages.
ruwiki TOP100 script shows about 300 hits for last (#100) place. It's
better to handle at least 5-10% of all requests.
There's another optimization on client-side: my counter filtered any
request to history, diff-s, pages not from article namespace etc. This
should be added into JS script (sorry, I can't do so right now because
I have no sysop rights).
...

Just like Edward, use a persistent server, not cgi/php. To avoid

exposing home brewen hacks to the wild web, we should stick to something
tried and true. I suggested to implement it as a Java servlet. Should be
fairly straight forward, and we have Tomcat running anyway.
Please see source:
 http://tools.wikimedia.de/~edwardspec/src/wikicnt_daemon.pl
This is written in Perl. Also, anything stange in HTTP connection
results in breaking it without answer. The only potential security
problem here is reading request line with
    my $req = <$c>;
(no check for long lines - this is not fatal for Perl but might take
some memory)
...

To get around latency issues with the database, don't spawn (cause

more load on the already troubled DB); instead, cache updates in RAM for
a minute or so, the flush the into the db in a single insert.
There's another problem. We need to save disk space too (seems like
default MediaWiki counter was disabled because it consumes too much
space - 4*12000*60*60*24 = 4147200000 bytes = 3955 Mb each day).
I used UPDATE statements insead. Yes, this is worse (for example,
INSERT can be optimized by writing into text file and applying it with
LOAD DATA LOCAL INFILE) but database can't become larger than 6 Mb
(for ruwiki with 900000 articles).
...

Edward used a lot of ram for a name -> id mapping. This should be

avoided - the name is unique, we don't need the page ID. If we want the
ID, it should be determined on the wikipedia server and supplied with
the request - I talked to Tim Starling about making this and other
useful things available as JS variables.
That's much more efficient to store IDs (they are smaller and always
fixed-sized). But actually this was requested by ruwiki users later in
order to save counter value after _renaming_ article. With titles,
this could be lost.
Now this is not a problem: the small copy of database (title as key
and id as value) was moved into GDBM file and in-memory cache is now
disabled. Database copy is updated each 00:00 (this takes 5-7 seconds)
and takes 14 Mb of disk space.
-- 
Edward Chernenko edwardspec@gmail.com

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [Toolserver-l] fwd: wikicnt_daemon.pl