Hi,
You may have followed the discussion on Wikimedia-l (and enwiki-l).
For a mere intellectual curiosity I would like to know why hashing the IPs with a varying salt won't work.
Wouldn't that provide a way to obfuscate IP addresses while maintaining uniqueness (i. e. a given IP gets alway hashed to the same hash).
Tim said in a message on enwiki-l that he has looked into the matter but haven't found any satisfying solution.
So what's the problem with salted hashes?
Note: I have read something about hashing but I am far from being an expert, please assume I am the classical layman.
Thanks in advance to anyone who will take the time to explain.
C ---------- Messaggio inoltrato ---------- Da: "Lila Tretikov" lila@wikimedia.org Data: 05/Apr/2015 11:30 Oggetto: Re: [Wikimedia-l] Announcing: The Wikipedia Prize! A: "Wikimedia Mailing List" wikimedia-l@lists.wikimedia.org Cc:
All,
As Tim mentioned we are seriously looking at privacy/identity/security/anonymity issues, specifically as it pertains to IP address exposure -- both from legal and technical standpoint. This won't happen overnight as we need to get people to work on this and there are a lot of asks, but this is on our radar.
On a related note, let's skip the sarcasm and treat each other with straightforward honestly. And for non-English speakers -- who are also (if not more) in need of this -- sarcasm can be very confusing.
Thanks, Lila
On Fri, Apr 3, 2015 at 4:02 PM, Cristian Consonni kikkocristian@gmail.com wrote:
Hi Brian,
2015-03-30 0:25 GMT+02:00 Brian reflection@gmail.com:
Although the initial goal of the Netflix Prize was to design a collaborative filtering algorithm, it became notorious when the data was used to de-anonymize Netflix users. Researchers proved that given just a user's movie ratings on one site, you can plug those ratings into
another
site, such as the IMDB. You can then take that information, and with
some
Google searches and optionally a bit of cash (for websites that sell
user
information, including, in some cases, their SSN) figure out who they
are.
You could even drive up to their house and take a selfie with them, or follow them to work and meet their boss and tell them about their views
on
the topics they were editing.
somewhat tangentially, and to bring back this to topic to a more scientific setting I would like to point out that there has already been reasearch in the past on this topic.
I highly recommend reading the following paper:
Lieberman, Michael D., and Jimmy Lin. "You Are Where You Edit: Locating Wikipedia Contributors through Edit Histories." ICWSM. 2009. (PDF <
http://www.pensivepuffin.com/dwmcphd/syllabi/infx598_wi12/papers/wikipedia/l...
)
For those of you that don't want to read the whole paper, you can find a recap of the most relevant findings in this presentation by Maurizio Napolitano: < http://www.slideshare.net/napo/social-geography-wikipedia-a-quick-overwiew
The main idea is associating spatial coordinates to a Wikipedia articles when possible, this articles are called "geopages". Then you extract from the history of articles the users which have edited a geopage. If you plot the geopages edited by a given contributor you can see that they tend to cluster, so you can define an "edit area". The study finds that 30-35% of contributors concentrate their edits in an edit area smaller than 1 deg^2 (~12,362 km^2, approximately the area of Connecticut or Northern Ireland[1] (thanks, Wikipedia!)).
For another free/libre project with a geographic focus like OpenStreetMap this is even more marked, check out for example this tool «“Your OSM Heat Map” (aka Where did you contribute?)»[2] by Pascal Neis.
This, of course, is not a straightforward de-anonimization but this methods work in principle for every contributor even if you obfuscate their IP or username (provided that you can still assign all the edits from a given user to a unique and univocal identifier)
C [1] https://en.wikipedia.org/wiki/Square_degree [2a] http://yosmhm.neis-one.org/ [2b] http://neis-one.org/2011/08/yosmhm/
Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe
_______________________________________________ Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe
Things that come to my mind: *range blocks become impossible, and its impossible to tell if vandals are using near by ips *cant do a whois on the ip to see if its a library or something
Suppose those first two come down to the drawback of not knowing the ip is you dont know the ip
More importantly, as details of the mapping become public, its hard to hide them again. IPv4 addresses are usually dynamic, eventually some people will publish what their hash is and ip, and then everyone knows the hash (and if you follow a specific user you may be able to link one hash to another hash as belonging to the same isp, and slowly puzzle things together. I imagine data mining algorithms could be effective here, especially if you have edit history from before and after the switch) this could result in a false sense of security. Often in privacy situations less security is better than false security
If people are looking into it, they probably know better than i do
--bawolff On Apr 5, 2015 7:34 AM, "Cristian Consonni" kikkocristian@gmail.com wrote:
Hi,
You may have followed the discussion on Wikimedia-l (and enwiki-l).
For a mere intellectual curiosity I would like to know why hashing the IPs with a varying salt won't work.
Wouldn't that provide a way to obfuscate IP addresses while maintaining uniqueness (i. e. a given IP gets alway hashed to the same hash).
Tim said in a message on enwiki-l that he has looked into the matter but haven't found any satisfying solution.
So what's the problem with salted hashes?
Note: I have read something about hashing but I am far from being an expert, please assume I am the classical layman.
Thanks in advance to anyone who will take the time to explain.
C ---------- Messaggio inoltrato ---------- Da: "Lila Tretikov" lila@wikimedia.org Data: 05/Apr/2015 11:30 Oggetto: Re: [Wikimedia-l] Announcing: The Wikipedia Prize! A: "Wikimedia Mailing List" wikimedia-l@lists.wikimedia.org Cc:
All,
As Tim mentioned we are seriously looking at privacy/identity/security/anonymity issues, specifically as it pertains to IP address exposure -- both from legal and technical standpoint. This
won't
happen overnight as we need to get people to work on this and there are a lot of asks, but this is on our radar.
On a related note, let's skip the sarcasm and treat each other with straightforward honestly. And for non-English speakers -- who are also (if not more) in need of this -- sarcasm can be very confusing.
Thanks, Lila
On Fri, Apr 3, 2015 at 4:02 PM, Cristian Consonni <kikkocristian@gmail.com
wrote:
Hi Brian,
2015-03-30 0:25 GMT+02:00 Brian reflection@gmail.com:
Although the initial goal of the Netflix Prize was to design a collaborative filtering algorithm, it became notorious when the data
was
used to de-anonymize Netflix users. Researchers proved that given
just a
user's movie ratings on one site, you can plug those ratings into
another
site, such as the IMDB. You can then take that information, and with
some
Google searches and optionally a bit of cash (for websites that sell
user
information, including, in some cases, their SSN) figure out who they
are.
You could even drive up to their house and take a selfie with them, or follow them to work and meet their boss and tell them about their
views
on
the topics they were editing.
somewhat tangentially, and to bring back this to topic to a more scientific setting I would like to point out that there has already been reasearch in the past on this topic.
I highly recommend reading the following paper:
Lieberman, Michael D., and Jimmy Lin. "You Are Where You Edit: Locating Wikipedia Contributors through Edit Histories." ICWSM. 2009. (PDF <
http://www.pensivepuffin.com/dwmcphd/syllabi/infx598_wi12/papers/wikipedia/l...
)
For those of you that don't want to read the whole paper, you can find a recap of the most relevant findings in this presentation by Maurizio Napolitano: <
http://www.slideshare.net/napo/social-geography-wikipedia-a-quick-overwiew
The main idea is associating spatial coordinates to a Wikipedia articles when possible, this articles are called "geopages". Then you extract from the history of articles the users which have edited a geopage. If you plot the geopages edited by a given contributor you can see that they tend to cluster, so you can define an "edit area". The study finds that 30-35% of contributors concentrate their edits in an edit area smaller than 1 deg^2 (~12,362 km^2, approximately the area of Connecticut or Northern Ireland[1] (thanks, Wikipedia!)).
For another free/libre project with a geographic focus like OpenStreetMap this is even more marked, check out for example this tool «“Your OSM Heat Map” (aka Where did you contribute?)»[2] by Pascal Neis.
This, of course, is not a straightforward de-anonimization but this methods work in principle for every contributor even if you obfuscate their IP or username (provided that you can still assign all the edits from a given user to a unique and univocal identifier)
C [1] https://en.wikipedia.org/wiki/Square_degree [2a] http://yosmhm.neis-one.org/ [2b] http://neis-one.org/2011/08/yosmhm/
Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe
Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
This has been discussed countless times. Some links for starters: https://phabricator.wikimedia.org/T20981
Nemo
2015-04-05 17:31 GMT+02:00 Brian Wolff bawolff@gmail.com:
Things that come to my mind: *range blocks become impossible, and its impossible to tell if vandals are using near by ips *cant do a whois on the ip to see if its a library or something
Oh, I didn't think about this!
What about: * creating a new permission group (say "IP watchers") that can see the IP in non-hashed form? * compile some sort of list and automatically tagging edits from schools and libraries? (this could be useful regardless of hashing IPs)
Suppose those first two come down to the drawback of not knowing the ip is you dont know the ip
I think that restricting the view of IPs to users that may need them (admins, checkusers, ...)
More importantly, as details of the mapping become public, its hard to hide them again. IPv4 addresses are usually dynamic, eventually some people will publish what their hash is and ip, and then everyone knows the hash (and if you follow a specific user you may be able to link one hash to another hash as belonging to the same isp, and slowly puzzle things together.
(This was also the idea of the main drawback of this algorithm as I imagined it)
I imagine data mining algorithms could be effective here, especially if you have edit history from before and after the switch) this could result in a false sense of security. Often in privacy situations less security is better than false security
Yeah, as I said on Wikimedia-l[1] there are already studies that can mine data from Wikipedia and locate the user within a area (a fairly large area, but still) and this would continue to be possible.
In this light probably even obfuscating IP only for unregistered users and keep them visible for registered users may be an idea.
Also, I don't think that hashing would provide greater security, probably would simply raise a little the bar for people wanting to locate users, but this would be a small bump in the road for an organization (say, the NSA) or individual with enough commitment and resources.
If people are looking into it, they probably know better than i do
Thanks for your answer!
2015-04-05 22:04 GMT+02:00 Federico Leva (Nemo) nemowiki@gmail.com:
This has been discussed countless times. Some links for starters: https://phabricator.wikimedia.org/T20981
Well, it doesn't look like the discussion was much more developed than what bawolff said here.
C [1] https://lists.wikimedia.org/pipermail/wikimedia-l/2015-April/077404.html
Cristian Consonni schreef op 2015/04/08 om 3:00:
2015-04-05 17:31 GMT+02:00 Brian Wolff bawolff@gmail.com:
Things that come to my mind: *range blocks become impossible, and its impossible to tell if vandals are using near by ips *cant do a whois on the ip to see if its a library or something
Oh, I didn't think about this!
What about:
- creating a new permission group (say "IP watchers") that can see the
IP in non-hashed form?
- compile some sort of list and automatically tagging edits from
schools and libraries? (this could be useful regardless of hashing IPs)
You are solving a problem that doesn't exist and creating more serious ones. There really is no "right to privacy" as it extends to editing Wikipedia, and that the WMF has manufactured one is more a source of trouble then benefit.
As it stands, pretty much any technically literate user can look at editing histories and begin contributing in analyzing vandalism patterns and making reports and decisions about them. I rely heavily on well-formed reports of vandalism by users that have already done the preliminary grunt work of detecting similar edits from people in the same geographic region or carrier, and I don't want anything that makes it more difficult for them to do it. Vandalism and block-evasion are *real* problems. The imaginary right to carry out public actions anonymously shouldn't get in the way of solving them.
KWW
2015-04-08 16:11 GMT+02:00 Kevin Wayne Williams kwwilliams@kwwilliams.com:
What about:
- creating a new permission group (say "IP watchers") that can see the
IP in non-hashed form?
- compile some sort of list and automatically tagging edits from
schools and libraries? (this could be useful regardless of hashing IPs)
You are solving a problem that doesn't exist and creating more serious ones. There really is no "right to privacy" as it extends to editing Wikipedia, and that the WMF has manufactured one is more a source of trouble then benefit.
1) I just wanted to discuss this from a technical point of view, I am not saying that *we must* implement this.
2) Actually this discussion started from another user on en.wiki and not from the Wikimedia Foundation, which IMHO did well in considering the problem and saying "we have looked into this but found no satisfactory solution so far".
C
On Apr 8, 2015 11:12 AM, "Kevin Wayne Williams" kwwilliams@kwwilliams.com wrote:
Cristian Consonni schreef op 2015/04/08 om 3:00:
2015-04-05 17:31 GMT+02:00 Brian Wolff bawolff@gmail.com:
Things that come to my mind: *range blocks become impossible, and its impossible to tell if vandals
are
using near by ips *cant do a whois on the ip to see if its a library or something
Oh, I didn't think about this!
What about:
- creating a new permission group (say "IP watchers") that can see the
IP in non-hashed form?
- compile some sort of list and automatically tagging edits from
schools and libraries? (this could be useful regardless of hashing IPs)
You are solving a problem that doesn't exist and creating more serious
ones. There really is no "right to privacy" as it extends to editing Wikipedia, and that the WMF has manufactured one is more a source of trouble then benefit.
As it stands, pretty much any technically literate user can look at
editing histories and begin contributing in analyzing vandalism patterns and making reports and decisions about them. I rely heavily on well-formed reports of vandalism by users that have already done the preliminary grunt work of detecting similar edits from people in the same geographic region or carrier, and I don't want anything that makes it more difficult for them to do it. Vandalism and block-evasion are *real* problems. The imaginary right to carry out public actions anonymously shouldn't get in the way of solving them.
KWW
Systemic bias due to real life consequences (or precieved real life consequences) to online actions is also a real problem (or potential to be anyhow. I don't know if anyone has attempted to measure that). For some people that might be fear of the nsa (or equivalent agency/evil big gov) but one doesn't have to reach for the gov boogyman to see legitament needs for privacy - harrasment campaigns by groups like wikipediocracy demonstrate why privacy is important. Which is why we have things like logged in users for pseudoanonoyminity.
Privacy and abuse mitigation are both goods but they are at odds. Where the appropriate balance is, is debatable but i think everyone would agree that extremes in either direction are not good for wikis (on one end you have citizendium - not very much vandalism on that site, not much of anything else either. On the other end you would have what would happen if we totally eliminated users (anon and registered) and all edits are independent of each other, which sounds unworkable to me at least but maybe extreme soft security [1] advocates would like it)
Tl;dr: both privacy and abuse mitigation are important. Extremes in either direction would suck, it is important to discuss trade-offs and find the best balance, which potentially might even be the status quo.
--bawolff
wikitech-l@lists.wikimedia.org