Edward Chernenko replied to my question about the wikicnt_daemon.pl script. He CCed to this list, but apperently, that did not go through. Maybe he's not registered here? He should be...
Anyway, below is his response, fyi:
-------- Original Message -------- Subject: Re: wikicnt_daemon.pl Date: Fri, 7 Jul 2006 12:02:49 +0400 From: Edward Chernenko edwardspec@gmail.com To: Daniel Kinzler daniel@brightbyte.de CC: toolserver-l@wikipedia.org References: 44AB8C72.90102@brightbyte.de
2006/7/5, Daniel Kinzler daniel@brightbyte.de:
Hi
When monitoring activity on the toolserver, I often notice your script wikicnt_daemon.pl - it seems to be started every few hours, run for quite a while, and there are often many instances running at once (33 at the moment). I suspect (but I'm not sure) that it may be one of the reasons the toolserver often falls behind with replicating from the master db. Critical resources are RAM and Disk-I/O, and thus SQL queries, of course.
Please tell me what that script does, and why there are so many instances at once. Please send a copy of your response to toolserver-l@Wikipedia.org. Thanks!
Regards, Daniel aka Duesentrieb
Hi Daniel,
This script is articles counter installed by admins of Russian Wikipedia. It must make 5-100 inserts into database per second. Now I'm going to move this from MySQL to GDBM database (this should reduce load) but this is not yet done.
Currently running this as daemon is quite a good optimization because of persistent connections to MySQL and usage of prepared statements.
Unfortunately, there's no threads support in Perl version installed and one thread can't dispatch all 5-100 requests per second because it's waiting too many time for MySQL server reply. So this script simply forks five times (so there're 32, not 33 threads) after making listening socket and before connecting to database.
Another optimization was caching results of 'page_title -> page_id' requests in memory (cron task restarted daemon each hour to clear it). Here I made a mistake: full cache (with info about all pages) takes 14 Mb. But after you report I realized that it can take 14*32 = 448 Mb. Now I moved this into GDBM database which is updated only one time per day. Please check, RAM usage should be now not more than 1-2 Mb.
I'm now working on future optimizations. But the counter should be on before this is done (it collects no info about time of requests, only hits from the moment it was launched first).
P.S. There're no required libraries installed (gdbm, sqlite) and perl modules (GDBM_FILE and DBD::SQLite).
On 07/07/06, Daniel Kinzler daniel@brightbyte.de wrote:
Edward Chernenko replied to my question about the wikicnt_daemon.pl script. He CCed to this list, but apperently, that did not go through. Maybe he's not registered here? He should be...
He is, but according to the subscriber list, he's on digest mode.
Anyway, below is his response, fyi:
-------- Original Message -------- Subject: Re: wikicnt_daemon.pl Date: Fri, 7 Jul 2006 12:02:49 +0400 From: Edward Chernenko edwardspec@gmail.com To: Daniel Kinzler daniel@brightbyte.de CC: toolserver-l@wikipedia.org References: 44AB8C72.90102@brightbyte.de
2006/7/5, Daniel Kinzler daniel@brightbyte.de:
Hi
When monitoring activity on the toolserver, I often notice your script wikicnt_daemon.pl - it seems to be started every few hours, run for quite a while, and there are often many instances running at once (33 at the moment). I suspect (but I'm not sure) that it may be one of the reasons the toolserver often falls behind with replicating from the master db. Critical resources are RAM and Disk-I/O, and thus SQL queries, of course.
Please tell me what that script does, and why there are so many instances at once. Please send a copy of your response to toolserver-l@Wikipedia.org. Thanks!
Regards, Daniel aka Duesentrieb
Hi Daniel,
This script is articles counter installed by admins of Russian Wikipedia. It must make 5-100 inserts into database per second. Now I'm going to move this from MySQL to GDBM database (this should reduce load) but this is not yet done.
I'm not too chuffed at the precedent this is going to set, to be honest. Each Russian Wikipedia page view results in a hit to Zedler...it's still going to be using RAM and disk, etc...if Wikimedia want this sort of stuff, they should bloody well set it up themselves.
Currently running this as daemon is quite a good optimization because of persistent connections to MySQL and usage of prepared statements.
Unfortunately, there's no threads support in Perl version installed and one thread can't dispatch all 5-100 requests per second because it's waiting too many time for MySQL server reply. So this script simply forks five times (so there're 32, not 33 threads) after making listening socket and before connecting to database.
Yes, the listening socket. A foreign port that you didn't have authorisation to use, so far as I know - it would almost certainly be documented internally if you had. And didn't Duesentrieb notice something weird about the setup?
Another optimization was caching results of 'page_title -> page_id' requests in memory (cron task restarted daemon each hour to clear it). Here I made a mistake: full cache (with info about all pages) takes 14 Mb. But after you report I realized that it can take 14*32 = 448 Mb. Now I moved this into GDBM database which is updated only one time per day. Please check, RAM usage should be now not more than 1-2 Mb.
*You* need to take responsibility for checking that *your* tools aren't causing *our* server to die. You've got to be reasonable and make sure you glance at it all periodically.
Rob Church
*You* need to take responsibility for checking that *your* tools aren't causing *our* server to die. You've got to be reasonable and make sure you glance at it all periodically.
Rob Church
Rob, I agree with most of what you said, except for the last statement - we are all developers donating our time to the common cause. We all make mistakes, and we help each other to fix them. Edward accepts responsibility by figuring out what's wrong and trying to fix it. Saying **your* tool on *our* server* is, in my opinion, improper.
--Yuri
On 07/07/06, Yuri Astrakhan yuriastrakhan@gmail.com wrote:
Rob, I agree with most of what you said, except for the last statement - we are all developers donating our time to the common cause. We all make mistakes, and we help each other to fix them. Edward accepts responsibility by figuring out what's wrong and trying to fix it. Saying *your* tool on *our* server is, in my opinion, improper.
Well, in MY opinion, the "common cause" sounds disgusting, Communistic and foul.
Good job we can't be persecuted for our opinions where I live.
Rob Church
Prosecuted for opinions - no, Scolded for childish behavior – yes
By saying "*our* server", you divided community between Edward and everyone else, and assigned yourself to be the official representative of "everyone else". Please first obtain such mandate, and only then make such statements. For now, I think you can only make statements as you, Rob, not as the group. Royal *we* is a bit out of fashion.
On 7/7/06, Rob Church robchur@gmail.com wrote:
On 07/07/06, Yuri Astrakhan yuriastrakhan@gmail.com wrote:
Rob, I agree with most of what you said, except for the last statement -
we
are all developers donating our time to the common cause. We all make mistakes, and we help each other to fix them. Edward accepts
responsibility
by figuring out what's wrong and trying to fix it. Saying *your* tool on *our* server is, in my opinion, improper.
Well, in MY opinion, the "common cause" sounds disgusting, Communistic and foul.
Good job we can't be persecuted for our opinions where I live.
Rob Church _______________________________________________ Toolserver-l mailing list Toolserver-l@Wikipedia.org http://mail.wikipedia.org/mailman/listinfo/toolserver-l
On 07/07/06, Yuri Astrakhan yuriastrakhan@gmail.com wrote:
By saying "*our* server", you divided community between Edward and everyone else, and assigned yourself to be the official representative of "everyone else". Please first obtain such mandate, and only then make such statements. For now, I think you can only make statements as you, Rob, not as the group. Royal *we* is a bit out of fashion.
Shall we drop the pedantic pissing about and get back to the real issue, then?
Rob Church
Shall we drop the pedantic pissing about and get back to the real issue, then?
Yes, please.
I talked to Leon about ways to make hit counters feasible, for all projects. The core points are:
* Just like Edward did, use JS code trigger an HTTP request on page views. But this should be throttled to a probability of 1% - or, for large projects, 0.1%. This should still give us usable stats for the most popular pages.
* Just like Edward, use a persistent server, not cgi/php. To avoid exposing home brewen hacks to the wild web, we should stick to something tried and true. I suggested to implement it as a Java servlet. Should be fairly straight forward, and we have Tomcat running anyway.
* To get around latency issues with the database, don't spawn (cause more load on the already troubled DB); instead, cache updates in RAM for a minute or so, the flush the into the db in a single insert.
* Edward used a lot of ram for a name -> id mapping. This should be avoided - the name is unique, we don't need the page ID. If we want the ID, it should be determined on the wikipedia server and supplied with the request - I talked to Tim Starling about making this and other useful things available as JS variables.
Perhaps Edward and Leon can work on this together. In any case, I would suggest to throttle updates from the ruwiki to 1% of page hits, *if* the page counter is to be enabled again. something like this should do:
if (round(random()*100)=1)...
Regards, -- Daniel
Daniel Kinzler schrieb:
Shall we drop the pedantic pissing about and get back to the real issue, then?
Yes, please.
I talked to Leon about ways to make hit counters feasible, for all projects. The core points are: [...]
Yeah. I wanted to create these stats for dewiki, so just like duesentrieb said, let the js just call the page with a probability of somethin maybe less than 1/1000. But NullC gave me a great idea now: I'll just let the js call an empty text file and collect the stats from the apachelogs. There's no more efficient way.
Leon
On 7/7/06, Leon Weber leon.weber@leonweber.de wrote:
Yeah. I wanted to create these stats for dewiki, so just like duesentrieb said, let the js just call the page with a probability of somethin maybe less than 1/1000. But NullC gave me a great idea now: I'll just let the js call an empty text file and collect the stats from the apachelogs. There's no more efficient way.
To clarify, IFF you disabled logging and you wrote an apache module to maintain a counter in shared memory using an efficient datastructure (perhaps a Judy array on page titles) and worked out the locking issues... it would be faster.
However, apache logging is async and append only. It's the simplest form of writing that could happen, and if we moved the logs into tmpfs it would likely be darn close to optimal.
Although toolserver is disk bound, what is killing us is random seeks (see iostat, we are constantly pegged at 350-400 TPS but moving less than 6MB/sec)... so I would not expect much problems from an async and append only writer because all of it's activity will be mostly sequential.
Leon Weber schrieb:
Daniel Kinzler schrieb:
Shall we drop the pedantic pissing about and get back to the real issue, then?
Yes, please.
I talked to Leon about ways to make hit counters feasible, for all projects. The core points are: [...]
Yeah. I wanted to create these stats for dewiki, so just like duesentrieb said, let the js just call the page with a probability of somethin maybe less than 1/1000. But NullC gave me a great idea now: I'll just let the js call an empty text file and collect the stats from the apachelogs. There's no more efficient way.
Problem is: How do you get the article name from the textfile? Anyway, it would be a nice idea to put webalizer ot other stats tool on the toolserver.
Greets, Marco
By the way, Leon had this topic on CC with his email adress, I put him out so he doesn't get duplicate mails.
2006/7/7, Daniel Kinzler daniel@brightbyte.de:
I talked to Leon about ways to make hit counters feasible, for all projects. The core points are:
- Just like Edward did, use JS code trigger an HTTP request on page
views. But this should be throttled to a probability of 1% - or, for large projects, 0.1%. This should still give us usable stats for the most popular pages.
ruwiki TOP100 script shows about 300 hits for last (#100) place. It's better to handle at least 5-10% of all requests.
There's another optimization on client-side: my counter filtered any request to history, diff-s, pages not from article namespace etc. This should be added into JS script (sorry, I can't do so right now because I have no sysop rights).
- Just like Edward, use a persistent server, not cgi/php. To avoid
exposing home brewen hacks to the wild web, we should stick to something tried and true. I suggested to implement it as a Java servlet. Should be fairly straight forward, and we have Tomcat running anyway.
Please see source: http://tools.wikimedia.de/~edwardspec/src/wikicnt_daemon.pl This is written in Perl. Also, anything stange in HTTP connection results in breaking it without answer. The only potential security problem here is reading request line with my $req = <$c>; (no check for long lines - this is not fatal for Perl but might take some memory)
- To get around latency issues with the database, don't spawn (cause
more load on the already troubled DB); instead, cache updates in RAM for a minute or so, the flush the into the db in a single insert.
There's another problem. We need to save disk space too (seems like default MediaWiki counter was disabled because it consumes too much space - 4*12000*60*60*24 = 4147200000 bytes = 3955 Mb each day).
I used UPDATE statements insead. Yes, this is worse (for example, INSERT can be optimized by writing into text file and applying it with LOAD DATA LOCAL INFILE) but database can't become larger than 6 Mb (for ruwiki with 900000 articles).
- Edward used a lot of ram for a name -> id mapping. This should be
avoided - the name is unique, we don't need the page ID. If we want the ID, it should be determined on the wikipedia server and supplied with the request - I talked to Tim Starling about making this and other useful things available as JS variables.
That's much more efficient to store IDs (they are smaller and always fixed-sized). But actually this was requested by ruwiki users later in order to save counter value after _renaming_ article. With titles, this could be lost.
Now this is not a problem: the small copy of database (title as key and id as value) was moved into GDBM file and in-memory cache is now disabled. Database copy is updated each 00:00 (this takes 5-7 seconds) and takes 14 Mb of disk space.
toolserver-l@lists.wikimedia.org