For a while now, we've been releasing squid log data, stripped of
personally identifying information such as IP addresses, to groups at
two universities: Vrije Universiteit and the University of Minnesota. We
now have a request pending from a third group, at Universidad Rey Juan
Carlos in Spain. They are asking if they can have the full data stream
including IP addresses, and they are prepared to sign a confidentiality
agreement to get it.
I'm leaning towards letting them have it. Via the confidentiality
agreement, we can avoid the most likely abuse scenarios, such as release
of individual user profiles. Currently we let toolserver users process
similar data, assisted by Wikipedia administrators who put web bugs on
the site. They use it to produce the WikiCharts report. Are we to tell
prospective research groups to use the toolserver, rather than their own
substantial hardware, for analysis of Wikipedia traffic patterns?
I'm not sure if this would be allowed on the privacy policy, which does
mention statistics, but doesn't say who is making them. Maybe the use of
web bugs by administrators is already against the privacy policy. In any
case, I think the question would benefit from community discussion,
which is why I am posting it here.
-- Tim Starling