[Foundation-l] Wikipedia Web Logs for scientific research

Mirco Nanni mirco.nanni at isti.cnr.it
Tue May 10 09:37:32 UTC 2005


Dear Wikimedia community,

   please have a look at the following proposal of 
collaboration I sent to the Wikimedia board right yesterday. 
The board (and myself too!) would like to hear your opinion 
on that, before taking a decision.
   I would like to specify that such collaboration will have 
scientific- and academic-only purposes, without any 
commercial involvement.
   Finally, the analysis software we are developing (and 
going to apply to Wikipedia data, if the proposal will be 
accepted) will be distributed in the scientific community as 
open source.

   Of course, I will be glad to provide you any detail and 
explanation you will think necessary.

   Thank you for your attention. Best regards,

	- Mirco


----------------------------------------------------------

Dear Sirs,

   I am writing to you on behalf of the KDD-Lab (Laboratory 
on Knowledge Discovery and Delivery: 
http://www-kdd.isti.cnr.it), a branch of the ISTI institute 
of the Italian National Research Centre (CNR).

   Our group is working (among the others) on a project 
regarding the analysis of the logs of web servers, and in 
recent days we are working on analysis techniques that seem 
to be best suited for "content-rich sites". Our first 
thought obviously went to Wikipedia...

   We would like to have the opportunity to apply our 
analysis techniques to the web logs of Wikipedia. Looking to 
the Wikipedia access statistics, we believe that an optimal 
amount of data would be the following: (1) the (raw) weblogs 
of the English section covering a few days of usage, or (2) 
a few weeks for the Italian section.

   Do you think it could be possible to start this kind of 
collaboration?

   Of course, we are willing to provide you all the legal 
agreements you will consider necessary, especially those 
regarding privacy. And, obviously, we will properly 
acknowledge your contribution in any of our scientifical 
publications and reports where we use it.

[Addendum: the sensible information in web logs is 
essentially located in the "client IP" field ("who visited 
that page"). However, for our research purposes such field 
is not strictly needed as an encrypted version of it would 
be enough, thus avoiding most of the privacy issues.]

   Thank you for your attention.
   Looking for receiving your answer and opinion, I send you 
my best regards,

                      - Mirco Nanni

  ====================================
   http://ercolino.isti.cnr.it/mirco
  ====================================




More information about the foundation-l mailing list