On 5 January 2016 at 13:01, Nuria Ruiz nuria@wikimedia.org wrote:
So, the goal is to have a UUID _distinct_ from IP and user agent (that is, the IP and UA are not related to the UUID that's generated) so that that UUID can be used as a baseline for accuracy purposes.
I understand. But let me re-explain: my point was mentioning that regarding #2 (decay) we already know that the IP + UA combo in many instances decays real slowly, so the long tail is very significant and that we really do not need a token to prove this fact.
I just wanted to mention research that has already been done so you have it also as a reference and we do not duplicate work.
so much as "does a user_agent/ip hash make a good UUID, generally".
Depends on what "generally" means, in mobile the answer is most definitely no. Again, you do not need a token to prove this fact, as mobile providers use sometimes a short IP range for tens of thousands of customers.
Agreed, on both points, but there is a big difference between "logically we can reason this is the case" and "we have proven that this is the case, and it impacts different groups in these proportions, and etc etc etc". The goal is not just to provide a reference point for internal use but also to write it up for publication so it can be used more generally, which means even things we all know to be true (like the mobile point) need validation.
On Sun, Jan 3, 2016 at 9:48 AM, Oliver Keyes okeyes@wikimedia.org wrote:
Hey Nuria,
So, the goal is to have a UUID _distinct_ from IP and user agent (that is, the IP and UA are not related to the UUID that's generated) so that that UUID can be used as a baseline for accuracy purposes. Think the UUID in the ModuleStorage test datasets from wayback. So it's not "can any individual user be de-aggregated" so much as "does a user_agent/ip hash make a good UUID, generally". If I'm understanding that page correctly, it's more aimed at the former problem.
On 3 January 2016 at 11:29, Nuria nuria@wikimedia.org wrote:
Oliver,
You might want to check our documentation in wikitech regarding identity reconstruction. I think it covers your point #1.
https://wikitech.wikimedia.org/wiki/Analytics/Data/Preventing_identity_recon...
Nuria
On Jan 2, 2016, at 10:00 AM, Oliver Keyes okeyes@wikimedia.org wrote:
Hey y'all
I'm working on a piece of research (largely recreational) on the old problem of fingerprinting users with minimal information - namely the combination of a user agent and an IP address. Basically I'm looking to put together a piece of work showing:
- How sub-standard it is;
- How fast it decays;
- How the sub-standardness varies by (platform|location)
This would be pretty doable with internal data; basically I'd need a schema with IP, user agent and a per-user UUID that's got a decent (>=24 hours) expiry time. My question: does anyone know of a table with recent data that meets these requirements? And, if not, anyone with EventLogging experience interested in working on the problem with me?
-- Oliver Keyes Count Logula Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Count Logula Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics