Hello,
Apologies for this being a repeat; I was just informed that the original ended up being read as part of an existing thread.
I'd like to make a request to obtain access to anonymized apache logs for wikipedia user data.
I am creating a browsing interface for wikipedia that requires clustered user data (in that sense it is akin to finding articles using the amazon recommendation system or the earlier movielens recommendation system).
For this I need access to user page requests over time- preferably stored in a database. I can provide a script that will translate users' ip addresses to a unique signature so that the users themselves remain anonymous, stuff the data into a reasonably size efficient mysql table, etc.
I was told that I might need to talk to Kate about the feasibility of doing this. Are there any existing objections to retaining anonymized apache log data for research purposes?
Tony Pryor
tpryor@media.mit.edu wrote:
Hello,
Apologies for this being a repeat; I was just informed that the original ended up being read as part of an existing thread.
I'd like to make a request to obtain access to anonymized apache logs for wikipedia user data.
I am creating a browsing interface for wikipedia that requires clustered user data (in that sense it is akin to finding articles using the amazon recommendation system or the earlier movielens recommendation system).
For this I need access to user page requests over time- preferably stored in a database. I can provide a script that will translate users' ip addresses to a unique signature so that the users themselves remain anonymous, stuff the data into a reasonably size efficient mysql table, etc.
I was told that I might need to talk to Kate about the feasibility of doing this. Are there any existing objections to retaining anonymized apache log data for research purposes?
Using publicly-available data you can find out the set of pages edited by each username. Then it is possible, with some degree of uncertainty, to link some usernames to one or more "unique signature"s (from your quoted text above), by matching sets of user page requests to sets of pages edited. Thus some of the data we would release to you is bordering on, if not definitely, personally identifiable data which is not already publicly available. The privacy policy [1] says that "personally identifiable data collected in the server logs will not be released by the developers who have access to it," except under certain circumstances, none of which cover this case.
[1] http://wikimediafoundation.org/wiki/Privacy_policy
Tony Pryor
--- en:user:jeronim --- Send instant messages to your online friends http://au.messenger.yahoo.com
Using publicly-available data you can find out the set of pages edited by each username. Then it is possible, with some degree of uncertainty, to link some usernames to one or more "unique signature"s (from your quoted text above), by matching sets of user page requests to sets of pages edited. Thus some of the data we would release to you is bordering on, if not definitely, personally identifiable data which is not already publicly available.
I would like to work around this issue such that no editor's identity is compromised. If the IP addresses of named wikipedia editors are discernable by matching their wikipedia dated edit saves to the corresponding apache log entries, when such log entries are found by the script, they can be skipped entirely, not SHA1 hashed. Thereby no data that is personally identifiable will exist.
The privacy policy [1] says that "personally identifiable data collected in the server logs will not be released by the developers who have access to it," except under certain circumstances, none of which cover this case.
[1] http://wikimediafoundation.org/wiki/Privacy_policy
Tony Pryor
en:user:jeronim
Send instant messages to your online friends http://au.messenger.yahoo.com _______________________________________________ Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l
tpryor@media.mit.edu wrote:
Using publicly-available data you can find out the set of pages edited by each username. Then it is possible, with some degree of uncertainty, to link some usernames to one or more "unique signature"s (from your quoted text above), by matching sets of user page requests to sets of pages edited. Thus some of the data we would release to you is bordering on, if not definitely, personally identifiable data which is not already publicly available.
I would like to work around this issue such that no editor's identity is compromised. If the IP addresses of named wikipedia editors are discernable by matching their wikipedia dated edit saves to the corresponding apache log entries, when such log entries are found by the script, they can be skipped entirely, not SHA1 hashed. Thereby no data that is personally identifiable will exist.
Even records of just page views (no edits) with no time stamps and a hashed IP will give you information about personal interests which can be fuzzily matched to named editors, by virtue of the fact that the set of pages viewed is likely a superset of the set of pages edited. People may have interests in certain topics, read Wikipedia pages related to those topics, but avoid editing these pages in order to keep it private. This personally-identifiable information, or at least an approximation of it, would be leaked, and privacy policy would be violated, no?
The privacy policy [1] says that "personally identifiable data collected in the server logs will not be released by the developers who have access to it," except under certain circumstances, none of which cover this case.
[1] http://wikimediafoundation.org/wiki/Privacy_policy
Tony Pryor
en:user:jeronim
Send instant messages to your online friends http://au.messenger.yahoo.com
On 7/25/05, tpryor@media.mit.edu tpryor@media.mit.edu wrote:
For this I need access to user page requests over time- preferably stored in a database. I can provide a script that will translate users' ip addresses to a unique signature so that the users themselves remain anonymous, stuff the data into a reasonably size efficient mysql table, etc.
Why would you need Wikipedia's data for the design? Couldn't you set up a wiki and use that data, or even use pseudo-random data?
I was told that I might need to talk to Kate about the feasibility of doing this. Are there any existing objections to retaining anonymized apache log data for research purposes?
No objections to anonymized data. The trouble is that you can't really anonymize this data all that easily. Just getting a program/script to do the anonymizing would be a great project.
wikitech-l@lists.wikimedia.org