Hello,
we have now some kind of 'what pages are visited' statistics. It isn't very trivial to separate exact pageviews, but this regular expression should does the rough job :)
urlre = re.compile('^http://(%5B%5E%5C.%5D+)%5C.wikipedia.org/wiki/(%5B%5E?%5D+)')
It is applied to our squid access-log stream and redirected to profiling agent (webstatscollector - then the hourly snapshots are written in very trivial format. This can be used to both noticing strange activities, as well as spotting trends (specific events show up really nicely - let it be a movie premiere, a national holiday or any scandal :). Last March, when I was experimenting with it, it was impossible not to notice that "300" did hit theatres, St.Patrick's day revealed Ireland, and there was some crazy DDoS against us.
Anyway, log files for now are at: http://dammit.lt/wikistats/ - didn't figure out yet retention policy, but as there're few gigs available, at least few weeks should be up. A normal snapshot contains ~3.5M page titles and extracted is over 100MB. Entries inside are grouped by project, and in semi-alphabetic order.
I'm experimenting with visualization software too, so if you have any ideas and are too lazy to implement - share them anyway :)
Cheers,
Domas Mituzas midom.lists@gmail.com escribió: Hello,
we have now some kind of 'what pages are visited' statistics.
Really cool, Domas, good work.
I'm experimenting with visualization software too, so if you have any ideas and are too lazy to implement - share them anyway :)
It heavily depends on what kind of parameters and analysis you want to visualize. As an all-in-one soultion, for both statistical analysis and visualization tools, there is nothing out there better than GNU R.
Regards,
Felipe.
Cheers,
On Dec 10, 2007 1:03 PM, Domas Mituzas midom.lists@gmail.com wrote:
Anyway, log files for now are at: http://dammit.lt/wikistats/
[...]
I'm experimenting with visualization software too, so if you have any ideas and are too lazy to implement - share them anyway :)
Speaking of lazy (camouflaged by "save electrons"..): would you be willing to provide per-project files, e.g. pagecounts-dewp-20071210-140000.gz ?
Anyhow: Thank you so much for providing these...
Mathias
Hi!
Speaking of lazy (camouflaged by "save electrons"..): would you be willing to provide per-project files, e.g. pagecounts-dewp-20071210-140000.gz ?
zgrep ^de pagecounts ;-)
That will be somewhat easier than having hundreds of files per snapshot on our side ;-)
On 10/12/2007, Domas Mituzas midom.lists@gmail.com wrote:
Hello,
we have now some kind of 'what pages are visited' statistics. It isn't very trivial to separate exact pageviews, but this regular expression should does the rough job :)
[...]
Hi,
Thanks for providing this data.
Just to confirm, each file has all the pageviews on all wikimedia sites for that hour in that day, is that right?
Each line has 4 fields: projectcode, pagename, then what does the pair of numbers mean?
thanks, Brianna
Brianna,
Just to confirm, each file has all the pageviews on all wikimedia sites for that hour in that day, is that right?
Of all wikipedias so far, needs tweaking filter to add other projects :)
Each line has 4 fields: projectcode, pagename, then what does the pair of numbers mean?
First means number of pageviews, second should mean bytes, but here again, I was too lazy to adjust filters to feed that information - will happen soonish :)
BR,
On 12/12/2007, Domas Mituzas midom.lists@gmail.com wrote:
Brianna,
Just to confirm, each file has all the pageviews on all wikimedia sites for that hour in that day, is that right?
Of all wikipedias so far, needs tweaking filter to add other projects :)
Hm, that may explain the seemingly small data for commons that I tried to extract. About 5000 entries out of all those zipped files. :)
So if it only Wikipedia why were those lines there at all??
cheers Brianna
Brianna Laugher wrote:
On 12/12/2007, Domas Mituzas wrote:
Brianna,
Just to confirm, each file has all the pageviews on all wikimedia sites for that hour in that day, is that right?
Of all wikipedias so far, needs tweaking filter to add other projects :)
Hm, that may explain the seemingly small data for commons that I tried to extract. About 5000 entries out of all those zipped files. :)
So if it only Wikipedia why were those lines there at all??
cheers Brianna
commons is in the 'wikipedia' project, just as meta. You can view it on the config and db names.
wikitech-l@lists.wikimedia.org