page counters - Wikitech-l

10 Dec 2007


      Hello,
we have now some kind of 'what pages are visited' statistics. It  
isn't very trivial to separate exact pageviews, but this regular  
expression should does the rough job :)
urlre = re.compile('^http://(%5B%5E%5C.%5D+)%5C.wikipedia.org/wiki/(%5B%5E?%5D+)')
It is applied to our squid access-log stream and redirected to  
profiling agent (webstatscollector - then the hourly snapshots are  
written in very trivial format.
This can be used to both noticing strange activities, as well as  
spotting trends (specific events show up really nicely - let it be a  
movie premiere, a national holiday or any scandal :). Last March,  
when I was experimenting with it, it was impossible not to notice  
that "300" did hit theatres, St.Patrick's day revealed Ireland, and  
there was some crazy DDoS against us.
Anyway, log files for now are at:
http://dammit.lt/wikistats/
- didn't figure out yet retention policy, but as there're few gigs  
available, at least few weeks should be up.
A normal snapshot contains ~3.5M page titles and extracted is over  
100MB. Entries inside are grouped by project, and in semi-alphabetic  
order.
I'm experimenting with visualization software too, so if you have any  
ideas and are too lazy to implement - share them anyway :)
Cheers,
-- 
Domas Mituzas -- http://dammit.lt/ -- [[user:midom]]