On 9/14/07, Ilya Haykinson <haykinson(a)gmail.com> wrote:
On 9/14/07, Tim Starling
<tstarling(a)wikimedia.org> wrote:
I wouldn't recommend using a hashed IP
address to anyone involved in
academic work. I've worked in the academic sector, I know how important
it is for data to be above any criticism. Any data using unique IP
addresses as an estimate of individual user population would be severely
skewed by proxies and NAT.
Perhaps in order to prevent potentially violating our own privacy
policy, we can meet the researchers half-way.
The best way to avoid violating the privacy policy would be to change
it to say exactly what it is you plan on doing, and to not give data
from before the policy is changed.
If we can find out the
reason they need IP addresses we can craft the data we send them to
satisfy their request. For example:
a) they could just need the unique addresses to link together browsing
patterns, but not care for them to be IP addresses. We could create
convert the addresses into a unique number (or a salted hash) and send
them the data.
In case anyone's seriously considering this, make sure you've read
[[AOL search data scandal]] which should show you why it's completely
useless. This is *especially* true with Wikipedia data, where the
urls we access constantly reveal who we are (e.g.
http://en.wikipedia.org/wiki/User_talk:Whatever).
b) they could be looking for network topology
information; we could
give them the first two or three octets of the IP address.
Three octects would be almost as bad as a) for the same reasons. Two
octets would be better, but less useful too.
c) they could be looking for geographical distribution
of queries; we
could do the geo-lookup of addresses and give them coordinate
resolution for each address instead of the address itself.
If that geo information is limited to country, I guess it wouldn't be too bad.