[Foundation-l] Release of squid log data

Tim Starling tstarling at wikimedia.org
Sat Sep 15 15:44:59 UTC 2007


Brian wrote:
> Although having this data is a wet dream of mine, I find it unconscionable
> to release it, and I feel that whoever was responsible for releasing it has
> already overstepped their bounds. We already know from the New York Times
> analyzing AOL's search logs that persons can be identified from search logs,
> and we know from Microsoft's Non-Disclosure Agreements with universities
> around the world for portions of the Windows 2000 source code that these
> NDAs, even to universities, are not effective in stopping the data from
> being leaked.

The data that has been released cannot be used to identify individuals. 
  The AOL search data could be used to identify individuals, because 
searches were tagged with a pseudonymous identifier. There are no such 
identifiers in the data we are sending out.

For example, a search for a social security number, by itself, tells you 
nothing about the individual who made it. Was it the owner of the SSN, 
an employer, or someone going through the man's rubbish? Or was it a 
Wikipedian trying to determine if someone's SSN is notable enough to 
include in an article?

In the unlikely event that someone types their life story into the 
search box and clicks "go", you still don't know who wrote it, whether 
it was autobiographical, slander or fantasy.

If you see the pattern of a person's requests to Wikipedia, then you can 
infer something about them. But you can't do that with the data we are 
sending.

And finally, note that we are not releasing this data publically, nor am 
I suggesting that we should. We are not sending it to anyone who wants 
it. We are sending it to three research groups at respectable universities.

I can imagine a research group being tempted to republish a code snippet 
from Windows 2000. I find it hard to imagine that a research group would 
be tempted to mine 100 billion log lines for some tiny fragment of 
private data, and then release that data publically or sell it to spammers.

-- Tim Starling





More information about the wikimedia-l mailing list