Dear Wikimedia community,
please have a look at the following proposal of collaboration I sent to the Wikimedia board right yesterday. The board (and myself too!) would like to hear your opinion on that, before taking a decision. I would like to specify that such collaboration will have scientific- and academic-only purposes, without any commercial involvement. Finally, the analysis software we are developing (and going to apply to Wikipedia data, if the proposal will be accepted) will be distributed in the scientific community as open source.
Of course, I will be glad to provide you any detail and explanation you will think necessary.
Thank you for your attention. Best regards,
- Mirco
----------------------------------------------------------
Dear Sirs,
I am writing to you on behalf of the KDD-Lab (Laboratory on Knowledge Discovery and Delivery: http://www-kdd.isti.cnr.it), a branch of the ISTI institute of the Italian National Research Centre (CNR).
Our group is working (among the others) on a project regarding the analysis of the logs of web servers, and in recent days we are working on analysis techniques that seem to be best suited for "content-rich sites". Our first thought obviously went to Wikipedia...
We would like to have the opportunity to apply our analysis techniques to the web logs of Wikipedia. Looking to the Wikipedia access statistics, we believe that an optimal amount of data would be the following: (1) the (raw) weblogs of the English section covering a few days of usage, or (2) a few weeks for the Italian section.
Do you think it could be possible to start this kind of collaboration?
Of course, we are willing to provide you all the legal agreements you will consider necessary, especially those regarding privacy. And, obviously, we will properly acknowledge your contribution in any of our scientifical publications and reports where we use it.
[Addendum: the sensible information in web logs is essentially located in the "client IP" field ("who visited that page"). However, for our research purposes such field is not strictly needed as an encrypted version of it would be enough, thus avoiding most of the privacy issues.]
Thank you for your attention. Looking for receiving your answer and opinion, I send you my best regards,
- Mirco Nanni
==================================== http://ercolino.isti.cnr.it/mirco ====================================
On 5/10/05, Mirco Nanni mirco.nanni@isti.cnr.it wrote:
Of course, we are willing to provide you all the legal agreements you will consider necessary, especially those regarding privacy. And, obviously, we will properly acknowledge your contribution in any of our scientifical publications and reports where we use it.
[Addendum: the sensible information in web logs is essentially located in the "client IP" field ("who visited that page"). However, for our research purposes such field is not strictly needed as an encrypted version of it would be enough, thus avoiding most of the privacy issues.]
The problem is if you substitute the IP with a unique number, and you still show accesses to user pages, you can probably identify the logged in users. I'd be OK if the IPs were masked AND accesses to non-article namespace pages were not given out.
Dori wrote:
[Addendum: the sensible information in web logs is essentially located in the "client IP" field ("who visited that page"). However, for our research purposes such field is not strictly needed as an encrypted version of it would be enough, thus avoiding most of the privacy issues.]
The problem is if you substitute the IP with a unique number, and you still show accesses to user pages, you can probably identify the logged in users. I'd be OK if the IPs were masked AND accesses to non-article namespace pages were not given out.
Well, our objective is not to make web accesses public, but to apply analysis techniques on them and possibly make some selected results public (something like -- but a bit more sophisticated and specific than -- the Webalizer system which is now used to build the Wikipedia usage statistics). However, you are right, masking IPs does not solve privacy problems once and for all. I agree with restricting to web traffic relative to articles, discarding personal pages and similar -- moreover, they are not very interesting for our research purposes.
- Mirco
==================================== http://ercolino.isti.cnr.it/mirco ====================================
wikimedia-l@lists.wikimedia.org