[Toolserver-l] fwd: wikicnt_daemon.pl

Fri Jul 7 12:23:19 UTC 2006

Edward Chernenko replied to my question about the wikicnt_daemon.pl
script. He CCed to this list, but apperently, that did not go through.
Maybe he's not registered here? He should be...

Anyway, below is his response, fyi:

-------- Original Message --------
Subject: Re: wikicnt_daemon.pl
Date: Fri, 7 Jul 2006 12:02:49 +0400
From: Edward Chernenko <edwardspec at gmail.com>
To: Daniel Kinzler <daniel at brightbyte.de>
CC: toolserver-l at wikipedia.org
References: <44AB8C72.90102 at brightbyte.de>

2006/7/5, Daniel Kinzler <daniel at brightbyte.de>:
> Hi
>
> When monitoring activity on the toolserver, I often notice your script
> wikicnt_daemon.pl - it seems to be started every few hours, run for
> quite a while, and there are often many instances running at once (33 at
> the moment). I suspect (but I'm not sure) that it may be one of the
> reasons the toolserver often falls behind with replicating from the
> master db. Critical resources are RAM and Disk-I/O, and thus SQL
> queries, of course.

> Please tell me what that script does, and why there are so many
> instances at once. Please send a copy of your response to
> <toolserver-l at Wikipedia.org>. Thanks!

> Regards,
> Daniel aka Duesentrieb

Hi Daniel,

This script is articles counter installed by admins of Russian Wikipedia.
It must make 5-100 inserts into database per second. Now I'm going
to move this from MySQL to GDBM database (this should reduce load) but
this is not yet done.

Currently running this as daemon is quite a good optimization because
of persistent connections to MySQL and usage of prepared statements.

Unfortunately, there's no threads support in Perl version installed
and one thread can't dispatch all 5-100 requests per second because
it's waiting too many time for MySQL server reply. So this script
simply forks five times (so there're 32, not 33 threads) after making
listening socket and before connecting to database.

Another optimization was caching results of 'page_title -> page_id'
requests in memory (cron task restarted daemon each hour to clear
it). Here I made a mistake: full cache (with info about all pages)
takes 14 Mb. But after you report I realized that it can take
14*32 = 448 Mb. Now I moved this into GDBM database which is updated
only one time per day. Please check, RAM usage should be now not more
than 1-2 Mb.

I'm now working on future optimizations. But the counter should be on
before this is done (it collects no info about time of requests, only
hits from the moment it was launched first).

P.S. There're no required libraries installed (gdbm, sqlite) and perl
modules (GDBM_FILE and DBD::SQLite).

-- 
Edward Chernenko <edwardspec at gmail.com>

-- 
Homepage: http://brightbyte.de