[Foundation-l] Release of squid log data - Wikimedia-l

14 Sep 2007

For a while now, we've been releasing squid log data, stripped of 
personally identifying information such as IP addresses, to groups at 
two universities: Vrije Universiteit and the University of Minnesota. We 
now have a request pending from a third group, at Universidad Rey Juan 
Carlos in Spain. They are asking if they can have the full data stream 
including IP addresses, and they are prepared to sign a confidentiality 
agreement to get it.

I'm leaning towards letting them have it. Via the confidentiality 
agreement, we can avoid the most likely abuse scenarios, such as release 
of individual user profiles. Currently we let toolserver users process 
similar data, assisted by Wikipedia administrators who put web bugs on 
the site. They use it to produce the WikiCharts report. Are we to tell 
prospective research groups to use the toolserver, rather than their own 
substantial hardware, for analysis of Wikipedia traffic patterns?

I'm not sure if this would be allowed on the privacy policy, which does 
mention statistics, but doesn't say who is making them. Maybe the use of 
web bugs by administrators is already against the privacy policy. In any 
case, I think the question would benefit from community discussion, 
which is why I am posting it here.

-- Tim Starling