Agreed, on both points, but there is a big difference between "logically we can reason this is the case" and "we have proven that this is the case, and it impacts different groups in these proportions, and etc etc etc".
I see. Very true.
which means even things we all know to be true (like the mobile point)
need validation. While I do not see anything wrong with documenting and quantifying it, it is worth to have in mind that mobile is a different case. The sharing of IPs across many users is common due to mobile protocols use of NAT-ing: https://en.wikipedia.org/wiki/Network_address_translation
Take a look at: http://stackoverflow.com/questions/10946624/finding-ip-address-for-iphone
On Tue, Jan 5, 2016 at 10:31 AM, Oliver Keyes okeyes@wikimedia.org wrote:
On 5 January 2016 at 13:01, Nuria Ruiz nuria@wikimedia.org wrote:
So, the goal is to have a UUID _distinct_ from IP and user agent (that
is,
the IP and UA are not related to the UUID that's generated) so that that UUID can be used as a baseline for accuracy purposes.
I understand. But let me re-explain: my point was mentioning that
regarding
#2 (decay) we already know that the IP + UA combo in many instances
decays
real slowly, so the long tail is very significant and that we really do
not
need a token to prove this fact.
I just wanted to mention research that has already been done so you have
it
also as a reference and we do not duplicate work.
so much as "does a user_agent/ip hash make a good UUID, generally".
Depends on what "generally" means, in mobile the answer is most
definitely
no. Again, you do not need a token to prove this fact, as mobile
providers
use sometimes a short IP range for tens of thousands of customers.
Agreed, on both points, but there is a big difference between "logically we can reason this is the case" and "we have proven that this is the case, and it impacts different groups in these proportions, and etc etc etc". The goal is not just to provide a reference point for internal use but also to write it up for publication so it can be used more generally, which means even things we all know to be true (like the mobile point) need validation.
On Sun, Jan 3, 2016 at 9:48 AM, Oliver Keyes okeyes@wikimedia.org
wrote:
Hey Nuria,
So, the goal is to have a UUID _distinct_ from IP and user agent (that is, the IP and UA are not related to the UUID that's generated) so that that UUID can be used as a baseline for accuracy purposes. Think the UUID in the ModuleStorage test datasets from wayback. So it's not "can any individual user be de-aggregated" so much as "does a user_agent/ip hash make a good UUID, generally". If I'm understanding that page correctly, it's more aimed at the former problem.
On 3 January 2016 at 11:29, Nuria nuria@wikimedia.org wrote:
Oliver,
You might want to check our documentation in wikitech regarding
identity
reconstruction. I think it covers your point #1.
https://wikitech.wikimedia.org/wiki/Analytics/Data/Preventing_identity_recon...
Nuria
On Jan 2, 2016, at 10:00 AM, Oliver Keyes okeyes@wikimedia.org
wrote:
Hey y'all
I'm working on a piece of research (largely recreational) on the old problem of fingerprinting users with minimal information - namely the combination of a user agent and an IP address. Basically I'm looking to put together a piece of work showing:
- How sub-standard it is;
- How fast it decays;
- How the sub-standardness varies by (platform|location)
This would be pretty doable with internal data; basically I'd need a schema with IP, user agent and a per-user UUID that's got a decent (>=24 hours) expiry time. My question: does anyone know of a table with recent data that meets these requirements? And, if not, anyone with EventLogging experience interested in working on the problem with me?
-- Oliver Keyes Count Logula Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Count Logula Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Count Logula Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics