So, the goal is to have a UUID _distinct_ from IP and user agent (that is,
the IP and UA are not related to the UUID that's generated) so
that that UUID can be used as a baseline for accuracy purposes.
I understand. But let me re-explain: my point was mentioning that regarding #2 (decay) we already know that the IP + UA combo in many instances decays real slowly, so the long tail is very significant and that we really do not need a token to prove this fact.
I just wanted to mention research that has already been done so you have it also as a reference and we do not duplicate work.
so much as "does a user_agent/ip hash make a good UUID, generally".
Depends on what "generally" means, in mobile the answer is most definitely no. Again, you do not need a token to prove this fact, as mobile providers use sometimes a short IP range for tens of thousands of customers.
On Sun, Jan 3, 2016 at 9:48 AM, Oliver Keyes okeyes@wikimedia.org wrote:
Hey Nuria,
So, the goal is to have a UUID _distinct_ from IP and user agent (that is, the IP and UA are not related to the UUID that's generated) so that that UUID can be used as a baseline for accuracy purposes. Think the UUID in the ModuleStorage test datasets from wayback. So it's not "can any individual user be de-aggregated" so much as "does a user_agent/ip hash make a good UUID, generally". If I'm understanding that page correctly, it's more aimed at the former problem.
On 3 January 2016 at 11:29, Nuria nuria@wikimedia.org wrote:
Oliver,
You might want to check our documentation in wikitech regarding identity reconstruction. I think it covers your point #1.
https://wikitech.wikimedia.org/wiki/Analytics/Data/Preventing_identity_recon...
Nuria
On Jan 2, 2016, at 10:00 AM, Oliver Keyes okeyes@wikimedia.org wrote:
Hey y'all
I'm working on a piece of research (largely recreational) on the old problem of fingerprinting users with minimal information - namely the combination of a user agent and an IP address. Basically I'm looking to put together a piece of work showing:
- How sub-standard it is;
- How fast it decays;
- How the sub-standardness varies by (platform|location)
This would be pretty doable with internal data; basically I'd need a schema with IP, user agent and a per-user UUID that's got a decent (>=24 hours) expiry time. My question: does anyone know of a table with recent data that meets these requirements? And, if not, anyone with EventLogging experience interested in working on the problem with me?
-- Oliver Keyes Count Logula Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Count Logula Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics