Hello,
we have now some kind of 'what pages are visited' statistics. It
isn't very trivial to separate exact pageviews, but this regular
expression should does the rough job :)
urlre =
re.compile('^http://([^\.]+)\.wikipedia.org/wiki/([^?]+)')
It is applied to our squid access-log stream and redirected to
profiling agent (webstatscollector - then the hourly snapshots are
written in very trivial format.
This can be used to both noticing strange activities, as well as
spotting trends (specific events show up really nicely - let it be a
movie premiere, a national holiday or any scandal :). Last March,
when I was experimenting with it, it was impossible not to notice
that "300" did hit theatres, St.Patrick's day revealed Ireland, and
there was some crazy DDoS against us.
Anyway, log files for now are at:
http://dammit.lt/wikistats/
- didn't figure out yet retention policy, but as there're few gigs
available, at least few weeks should be up.
A normal snapshot contains ~3.5M page titles and extracted is over
100MB. Entries inside are grouped by project, and in semi-alphabetic
order.
I'm experimenting with visualization software too, so if you have any
ideas and are too lazy to implement - share them anyway :)
Cheers,
--
Domas Mituzas --
http://dammit.lt/ -- [[user:midom]]