Hey y'all
I'm working on a piece of research (largely recreational) on the old problem of fingerprinting users with minimal information - namely the combination of a user agent and an IP address. Basically I'm looking to put together a piece of work showing:
1. How sub-standard it is; 2. How fast it decays; 3. How the sub-standardness varies by (platform|location)
This would be pretty doable with internal data; basically I'd need a schema with IP, user agent and a per-user UUID that's got a decent (>=24 hours) expiry time. My question: does anyone know of a table with recent data that meets these requirements? And, if not, anyone with EventLogging experience interested in working on the problem with me?
Per https://phabricator.wikimedia.org/T119144 , you are probably out of luck, as it seems there is basically no current EventLogging table with valid IPs (IP hashes) ...
Disregarding that, you could take a look at MobileWebSectionUsage or MobileWebUIClickTracking.
On Sat, Jan 2, 2016 at 10:00 AM, Oliver Keyes okeyes@wikimedia.org wrote:
Hey y'all
I'm working on a piece of research (largely recreational) on the old problem of fingerprinting users with minimal information - namely the combination of a user agent and an IP address. Basically I'm looking to put together a piece of work showing:
- How sub-standard it is;
- How fast it decays;
- How the sub-standardness varies by (platform|location)
This would be pretty doable with internal data; basically I'd need a schema with IP, user agent and a per-user UUID that's got a decent (>=24 hours) expiry time. My question: does anyone know of a table with recent data that meets these requirements? And, if not, anyone with EventLogging experience interested in working on the problem with me?
-- Oliver Keyes Count Logula Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
So we've just had invalid IPs...for how long? And this hasn't been fixed after how long reported?
Awesome work surfacing the bug. But the fact that this was not fixed thus far and, moreover, that nobody on Analytics Engineering let consumers know (unless there's a thread I've missed somewhere) is deeply concerning. We have schemas and analysis that rely on this field. As a customer I would like to know what the scheduling on fixing this bug.
On 2 January 2016 at 20:47, Tilman Bayer tbayer@wikimedia.org wrote:
Per https://phabricator.wikimedia.org/T119144 , you are probably out of luck, as it seems there is basically no current EventLogging table with valid IPs (IP hashes) ...
Disregarding that, you could take a look at MobileWebSectionUsage or MobileWebUIClickTracking.
On Sat, Jan 2, 2016 at 10:00 AM, Oliver Keyes okeyes@wikimedia.org wrote:
Hey y'all
I'm working on a piece of research (largely recreational) on the old problem of fingerprinting users with minimal information - namely the combination of a user agent and an IP address. Basically I'm looking to put together a piece of work showing:
- How sub-standard it is;
- How fast it decays;
- How the sub-standardness varies by (platform|location)
This would be pretty doable with internal data; basically I'd need a schema with IP, user agent and a per-user UUID that's got a decent (>=24 hours) expiry time. My question: does anyone know of a table with recent data that meets these requirements? And, if not, anyone with EventLogging experience interested in working on the problem with me?
-- Oliver Keyes Count Logula Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Tilman Bayer Senior Analyst Wikimedia Foundation IRC (Freenode): HaeB
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
On Sun, Jan 3, 2016 at 7:16 AM, Oliver Keyes okeyes@wikimedia.org wrote:
So we've just had invalid IPs...for how long?
See the discussion in the Phabricator task: At least since June 2015 (the time of the HTTPS-only rollout)
Oliver,
You might want to check our documentation in wikitech regarding identity reconstruction. I think it covers your point #1.
https://wikitech.wikimedia.org/wiki/Analytics/Data/Preventing_identity_recon...
Nuria
On Jan 2, 2016, at 10:00 AM, Oliver Keyes okeyes@wikimedia.org wrote:
Hey y'all
I'm working on a piece of research (largely recreational) on the old problem of fingerprinting users with minimal information - namely the combination of a user agent and an IP address. Basically I'm looking to put together a piece of work showing:
- How sub-standard it is;
- How fast it decays;
- How the sub-standardness varies by (platform|location)
This would be pretty doable with internal data; basically I'd need a schema with IP, user agent and a per-user UUID that's got a decent (>=24 hours) expiry time. My question: does anyone know of a table with recent data that meets these requirements? And, if not, anyone with EventLogging experience interested in working on the problem with me?
-- Oliver Keyes Count Logula Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Hey Nuria,
So, the goal is to have a UUID _distinct_ from IP and user agent (that is, the IP and UA are not related to the UUID that's generated) so that that UUID can be used as a baseline for accuracy purposes. Think the UUID in the ModuleStorage test datasets from wayback. So it's not "can any individual user be de-aggregated" so much as "does a user_agent/ip hash make a good UUID, generally". If I'm understanding that page correctly, it's more aimed at the former problem.
On 3 January 2016 at 11:29, Nuria nuria@wikimedia.org wrote:
Oliver,
You might want to check our documentation in wikitech regarding identity reconstruction. I think it covers your point #1.
https://wikitech.wikimedia.org/wiki/Analytics/Data/Preventing_identity_recon...
Nuria
On Jan 2, 2016, at 10:00 AM, Oliver Keyes okeyes@wikimedia.org wrote:
Hey y'all
I'm working on a piece of research (largely recreational) on the old problem of fingerprinting users with minimal information - namely the combination of a user agent and an IP address. Basically I'm looking to put together a piece of work showing:
- How sub-standard it is;
- How fast it decays;
- How the sub-standardness varies by (platform|location)
This would be pretty doable with internal data; basically I'd need a schema with IP, user agent and a per-user UUID that's got a decent (>=24 hours) expiry time. My question: does anyone know of a table with recent data that meets these requirements? And, if not, anyone with EventLogging experience interested in working on the problem with me?
-- Oliver Keyes Count Logula Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
I think it wasn't fixed because no one was complaining, which was used to back the argument that we should remove clientIp altogether.
On Jan 3, 2016, at 09:48, Oliver Keyes okeyes@wikimedia.org wrote:
Hey Nuria,
So, the goal is to have a UUID _distinct_ from IP and user agent (that is, the IP and UA are not related to the UUID that's generated) so that that UUID can be used as a baseline for accuracy purposes. Think the UUID in the ModuleStorage test datasets from wayback. So it's not "can any individual user be de-aggregated" so much as "does a user_agent/ip hash make a good UUID, generally". If I'm understanding that page correctly, it's more aimed at the former problem.
On 3 January 2016 at 11:29, Nuria nuria@wikimedia.org wrote: Oliver,
You might want to check our documentation in wikitech regarding identity reconstruction. I think it covers your point #1.
https://wikitech.wikimedia.org/wiki/Analytics/Data/Preventing_identity_recon...
Nuria
On Jan 2, 2016, at 10:00 AM, Oliver Keyes okeyes@wikimedia.org wrote:
Hey y'all
I'm working on a piece of research (largely recreational) on the old problem of fingerprinting users with minimal information - namely the combination of a user agent and an IP address. Basically I'm looking to put together a piece of work showing:
- How sub-standard it is;
- How fast it decays;
- How the sub-standardness varies by (platform|location)
This would be pretty doable with internal data; basically I'd need a schema with IP, user agent and a per-user UUID that's got a decent (>=24 hours) expiry time. My question: does anyone know of a table with recent data that meets these requirements? And, if not, anyone with EventLogging experience interested in working on the problem with me?
-- Oliver Keyes Count Logula Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Count Logula Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
So, the goal is to have a UUID _distinct_ from IP and user agent (that is,
the IP and UA are not related to the UUID that's generated) so
that that UUID can be used as a baseline for accuracy purposes.
I understand. But let me re-explain: my point was mentioning that regarding #2 (decay) we already know that the IP + UA combo in many instances decays real slowly, so the long tail is very significant and that we really do not need a token to prove this fact.
I just wanted to mention research that has already been done so you have it also as a reference and we do not duplicate work.
so much as "does a user_agent/ip hash make a good UUID, generally".
Depends on what "generally" means, in mobile the answer is most definitely no. Again, you do not need a token to prove this fact, as mobile providers use sometimes a short IP range for tens of thousands of customers.
On Sun, Jan 3, 2016 at 9:48 AM, Oliver Keyes okeyes@wikimedia.org wrote:
Hey Nuria,
So, the goal is to have a UUID _distinct_ from IP and user agent (that is, the IP and UA are not related to the UUID that's generated) so that that UUID can be used as a baseline for accuracy purposes. Think the UUID in the ModuleStorage test datasets from wayback. So it's not "can any individual user be de-aggregated" so much as "does a user_agent/ip hash make a good UUID, generally". If I'm understanding that page correctly, it's more aimed at the former problem.
On 3 January 2016 at 11:29, Nuria nuria@wikimedia.org wrote:
Oliver,
You might want to check our documentation in wikitech regarding identity reconstruction. I think it covers your point #1.
https://wikitech.wikimedia.org/wiki/Analytics/Data/Preventing_identity_recon...
Nuria
On Jan 2, 2016, at 10:00 AM, Oliver Keyes okeyes@wikimedia.org wrote:
Hey y'all
I'm working on a piece of research (largely recreational) on the old problem of fingerprinting users with minimal information - namely the combination of a user agent and an IP address. Basically I'm looking to put together a piece of work showing:
- How sub-standard it is;
- How fast it decays;
- How the sub-standardness varies by (platform|location)
This would be pretty doable with internal data; basically I'd need a schema with IP, user agent and a per-user UUID that's got a decent (>=24 hours) expiry time. My question: does anyone know of a table with recent data that meets these requirements? And, if not, anyone with EventLogging experience interested in working on the problem with me?
-- Oliver Keyes Count Logula Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Count Logula Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
On 5 January 2016 at 13:01, Nuria Ruiz nuria@wikimedia.org wrote:
So, the goal is to have a UUID _distinct_ from IP and user agent (that is, the IP and UA are not related to the UUID that's generated) so that that UUID can be used as a baseline for accuracy purposes.
I understand. But let me re-explain: my point was mentioning that regarding #2 (decay) we already know that the IP + UA combo in many instances decays real slowly, so the long tail is very significant and that we really do not need a token to prove this fact.
I just wanted to mention research that has already been done so you have it also as a reference and we do not duplicate work.
so much as "does a user_agent/ip hash make a good UUID, generally".
Depends on what "generally" means, in mobile the answer is most definitely no. Again, you do not need a token to prove this fact, as mobile providers use sometimes a short IP range for tens of thousands of customers.
Agreed, on both points, but there is a big difference between "logically we can reason this is the case" and "we have proven that this is the case, and it impacts different groups in these proportions, and etc etc etc". The goal is not just to provide a reference point for internal use but also to write it up for publication so it can be used more generally, which means even things we all know to be true (like the mobile point) need validation.
On Sun, Jan 3, 2016 at 9:48 AM, Oliver Keyes okeyes@wikimedia.org wrote:
Hey Nuria,
So, the goal is to have a UUID _distinct_ from IP and user agent (that is, the IP and UA are not related to the UUID that's generated) so that that UUID can be used as a baseline for accuracy purposes. Think the UUID in the ModuleStorage test datasets from wayback. So it's not "can any individual user be de-aggregated" so much as "does a user_agent/ip hash make a good UUID, generally". If I'm understanding that page correctly, it's more aimed at the former problem.
On 3 January 2016 at 11:29, Nuria nuria@wikimedia.org wrote:
Oliver,
You might want to check our documentation in wikitech regarding identity reconstruction. I think it covers your point #1.
https://wikitech.wikimedia.org/wiki/Analytics/Data/Preventing_identity_recon...
Nuria
On Jan 2, 2016, at 10:00 AM, Oliver Keyes okeyes@wikimedia.org wrote:
Hey y'all
I'm working on a piece of research (largely recreational) on the old problem of fingerprinting users with minimal information - namely the combination of a user agent and an IP address. Basically I'm looking to put together a piece of work showing:
- How sub-standard it is;
- How fast it decays;
- How the sub-standardness varies by (platform|location)
This would be pretty doable with internal data; basically I'd need a schema with IP, user agent and a per-user UUID that's got a decent (>=24 hours) expiry time. My question: does anyone know of a table with recent data that meets these requirements? And, if not, anyone with EventLogging experience interested in working on the problem with me?
-- Oliver Keyes Count Logula Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Count Logula Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Agreed, on both points, but there is a big difference between "logically we can reason this is the case" and "we have proven that this is the case, and it impacts different groups in these proportions, and etc etc etc".
I see. Very true.
which means even things we all know to be true (like the mobile point)
need validation. While I do not see anything wrong with documenting and quantifying it, it is worth to have in mind that mobile is a different case. The sharing of IPs across many users is common due to mobile protocols use of NAT-ing: https://en.wikipedia.org/wiki/Network_address_translation
Take a look at: http://stackoverflow.com/questions/10946624/finding-ip-address-for-iphone
On Tue, Jan 5, 2016 at 10:31 AM, Oliver Keyes okeyes@wikimedia.org wrote:
On 5 January 2016 at 13:01, Nuria Ruiz nuria@wikimedia.org wrote:
So, the goal is to have a UUID _distinct_ from IP and user agent (that
is,
the IP and UA are not related to the UUID that's generated) so that that UUID can be used as a baseline for accuracy purposes.
I understand. But let me re-explain: my point was mentioning that
regarding
#2 (decay) we already know that the IP + UA combo in many instances
decays
real slowly, so the long tail is very significant and that we really do
not
need a token to prove this fact.
I just wanted to mention research that has already been done so you have
it
also as a reference and we do not duplicate work.
so much as "does a user_agent/ip hash make a good UUID, generally".
Depends on what "generally" means, in mobile the answer is most
definitely
no. Again, you do not need a token to prove this fact, as mobile
providers
use sometimes a short IP range for tens of thousands of customers.
Agreed, on both points, but there is a big difference between "logically we can reason this is the case" and "we have proven that this is the case, and it impacts different groups in these proportions, and etc etc etc". The goal is not just to provide a reference point for internal use but also to write it up for publication so it can be used more generally, which means even things we all know to be true (like the mobile point) need validation.
On Sun, Jan 3, 2016 at 9:48 AM, Oliver Keyes okeyes@wikimedia.org
wrote:
Hey Nuria,
So, the goal is to have a UUID _distinct_ from IP and user agent (that is, the IP and UA are not related to the UUID that's generated) so that that UUID can be used as a baseline for accuracy purposes. Think the UUID in the ModuleStorage test datasets from wayback. So it's not "can any individual user be de-aggregated" so much as "does a user_agent/ip hash make a good UUID, generally". If I'm understanding that page correctly, it's more aimed at the former problem.
On 3 January 2016 at 11:29, Nuria nuria@wikimedia.org wrote:
Oliver,
You might want to check our documentation in wikitech regarding
identity
reconstruction. I think it covers your point #1.
https://wikitech.wikimedia.org/wiki/Analytics/Data/Preventing_identity_recon...
Nuria
On Jan 2, 2016, at 10:00 AM, Oliver Keyes okeyes@wikimedia.org
wrote:
Hey y'all
I'm working on a piece of research (largely recreational) on the old problem of fingerprinting users with minimal information - namely the combination of a user agent and an IP address. Basically I'm looking to put together a piece of work showing:
- How sub-standard it is;
- How fast it decays;
- How the sub-standardness varies by (platform|location)
This would be pretty doable with internal data; basically I'd need a schema with IP, user agent and a per-user UUID that's got a decent (>=24 hours) expiry time. My question: does anyone know of a table with recent data that meets these requirements? And, if not, anyone with EventLogging experience interested in working on the problem with me?
-- Oliver Keyes Count Logula Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Count Logula Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Count Logula Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics