[Foundation-l] Wikipedia tracks user behaviour via third party companies

Thu Jun 4 17:19:24 UTC 2009

[repost with proper subscribed mail address]

Alex wrote:

> The plain pageview stats are already available.
> Erik Zachte has been doing some work on other stats.
> <http://stats.wikimedia.org/EN/VisitorsSampledLogRequests.htm>

> If I were to compile a wishlist of stats things:

> 1. stats.grok.se data for non-Wikipedia projects 2. A better interface 
> for stats.wikimedia.org - There's a lot of data there, but it can be 
> hard to find it and its not very publicized. The only reason I knew 
> about the link above is because someone pointed it out to me once and 
> I bookmarked it.
> 3. Pageview stats at <http://dammit.lt/wikistats/> in files based on 
> projects. It would be a lot easier for people at the West Flemish 
> Wikipedia to analyze statistics themselves if they didn't have to 
> download tons of data they don't need.

Your enhancement requests:

1 IIRC this is already a (albeit undocumented) feature. 
One can manually alter the url to find e.g. wiktionary stats.
But I forgot precisely how and see nothing on User:Henriks talk page.

2 Seconded whole heartedly. In fact I started to reshape the main page (just
eight links) this week :) I just uploaded it a bit earlier than planned:
http://stats.wikimedia.org/

3 That could be a useful extension on the preservation script described
below.

--------------------------------
General response 

I would say since begin 2008 quite a lot has happened. A recap:

As already has been said Domas' (and Tim's) work was a major step forward.

http://dammit.lt/wikistats/

Two very useful aggregators of these on a page by page basis are

http://stats.grok.se/
http://wikistics.falsikon.de/

Based on the same data, on a higher aggregation level there are visitors
counts for all projects in a easily digestible fashion

http://stats.wikimedia.org/EN/TablesPageViewsMonthly.htm

Also since two months we know much more about Wikimedia traffic based on 8
reports with all kinds of cross sections:

http://infodisiac.com/blog/2009/04/wikimedia-traffic-analyzed/

With regard to dammit.lt raw data I helped to preserve these for posterity
in a more compact and slightly filtered state, so that we can query them
much longer. (dammit.lt server has space for one or two months) Actually
Mathias Schindler started this important rescue effort. Each day all files
are downloaded and processed, reduced from 40 Gb per month to 3 Gb (May
2009). I also made a script to query these files, which is much more
efficiently than processing the original hourly files. But runtime is still
considerably so querying these files without restraints through a public
interface is not advisable. But the toolserver could get a copy of the files
of course.

http://infodisiac.com/blog/wp-content/uploads/2009/05/influenza1.png

Is this enough? Of course not, there is so much more to learn.

Considering geo data: for many months a patch for Domas' (and Tims) code has
been laying around, by Antonio José Reinoso Peinado, that would add country
level geolocation data from Maxmind's public database (ip->geo lookup).
Although I promised to look at it, I haven't found the time yet.

Considering web bugs: comScore also proposed such a scheme to us.
Apart from the question how much it would bring us that we don't or can't
figure out ourselves an overriding concern is privacy.

Erik Zachte
Data Analyst
Wikimedia Foundation, Inc.
E-Mail: ezachte at wikimedia.org