I’m sharing a proposal that Reid Priedhorsky and his collaborators at Los Alamos National Laboratory recently submitted to the Wikimedia Analytics Team aimed at producing privacy-preserving geo-aggregates of Wikipedia pageview data dumps and making them available to the public and the research community. [1]
Reid and his team spearheaded the use of the public Wikipedia pageview dumps to monitor and forecast the spread of influenza and other diseases, using language as a proxy for location. This proposal describes an aggregation strategy adding a geographical dimension to the existing dumps.
Feedback on the proposal is welcome on the lists or the project talk page on Meta [3]
Dario
[1] https://meta.wikimedia.org/wiki/Research:Geo-aggregation_of_Wikipedia_pagevi... [2] http://dx.doi.org/10.1371/journal.pcbi.1003892 [3] https://meta.wikimedia.org/wiki/Research_talk:Geo-aggregation_of_Wikipedia_p...
Hi Dario, Reid,
This seems sensible enough and proposal #3 is clearly the better approach. An explicit opt-in opt-out mechanism would not be worth the effort to build and would become yet another ignored preferences setting after a few weeks...
A couple of thoughts:
* I understand the reasoning for not using do-not-track headers (#4); however, it feels a bit odd to say "they probably don't mean us" and skip them... I can almost guarantee you'll have at least one person making a vocal fuss about not being able to opt-out without an account. If we were to honour these headers, would it make a significant change to the amount of data available? Would it likely skew it any more than leaving off logged-in users?
* Option 3 does releases one further piece of information over and above those listed - an approximate ratio of logged in versus non-logged-in pageviews for a page. I cannot see any particular problem with doing this (and I can think of a couple of fun things to use it for) but it's probably worth being aware.
Andrew.
On 13 January 2015 at 07:26, Dario Taraborelli dtaraborelli@wikimedia.org wrote:
I’m sharing a proposal that Reid Priedhorsky and his collaborators at Los Alamos National Laboratory recently submitted to the Wikimedia Analytics Team aimed at producing privacy-preserving geo-aggregates of Wikipedia pageview data dumps and making them available to the public and the research community. [1]
Reid and his team spearheaded the use of the public Wikipedia pageview dumps to monitor and forecast the spread of influenza and other diseases, using language as a proxy for location. This proposal describes an aggregation strategy adding a geographical dimension to the existing dumps.
Feedback on the proposal is welcome on the lists or the project talk page on Meta [3]
Dario
[1] https://meta.wikimedia.org/wiki/Research:Geo-aggregation_of_Wikipedia_pagevi... [2] http://dx.doi.org/10.1371/journal.pcbi.1003892 [3] https://meta.wikimedia.org/wiki/Research_talk:Geo-aggregation_of_Wikipedia_p... _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Andrew,
I think it is reasonable to assume that the "Do not track" header isn't referring to this.
From http://donottrack.us/ with emphasis added.
Do Not Track is a technology and policy proposal that enables users to opt out of *tracking by websites they do not visit*, [...]
Do not track is explicitly for third party tracking. We are merely proposing to count those people who do access our sites. Note that, in this case, we are not interested in obtaining identifiers at all, so the word "track" seems to not apply.
It seems like we're looking for something like a "Do Not Log Anything At All" header. I don't believe that such a thing exists -- but if it did I think it would be good if we supported it.
-Aaron
On Tue, Jan 13, 2015 at 2:03 PM, Andrew Gray andrew.gray@dunelm.org.uk wrote:
Hi Dario, Reid,
This seems sensible enough and proposal #3 is clearly the better approach. An explicit opt-in opt-out mechanism would not be worth the effort to build and would become yet another ignored preferences setting after a few weeks...
A couple of thoughts:
- I understand the reasoning for not using do-not-track headers (#4);
however, it feels a bit odd to say "they probably don't mean us" and skip them... I can almost guarantee you'll have at least one person making a vocal fuss about not being able to opt-out without an account. If we were to honour these headers, would it make a significant change to the amount of data available? Would it likely skew it any more than leaving off logged-in users?
- Option 3 does releases one further piece of information over and
above those listed - an approximate ratio of logged in versus non-logged-in pageviews for a page. I cannot see any particular problem with doing this (and I can think of a couple of fun things to use it for) but it's probably worth being aware.
Andrew.
On 13 January 2015 at 07:26, Dario Taraborelli dtaraborelli@wikimedia.org wrote:
I’m sharing a proposal that Reid Priedhorsky and his collaborators at
Los Alamos National Laboratory recently submitted to the Wikimedia Analytics Team aimed at producing privacy-preserving geo-aggregates of Wikipedia pageview data dumps and making them available to the public and the research community. [1]
Reid and his team spearheaded the use of the public Wikipedia pageview
dumps to monitor and forecast the spread of influenza and other diseases, using language as a proxy for location. This proposal describes an aggregation strategy adding a geographical dimension to the existing dumps.
Feedback on the proposal is welcome on the lists or the project talk
page on Meta [3]
Dario
[1]
https://meta.wikimedia.org/wiki/Research:Geo-aggregation_of_Wikipedia_pagevi...
https://meta.wikimedia.org/wiki/Research_talk:Geo-aggregation_of_Wikipedia_p...
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
--
- Andrew Gray andrew.gray@dunelm.org.uk
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
+1 Aaron
On Tue, Jan 13, 2015 at 3:24 PM, Aaron Halfaker ahalfaker@wikimedia.org wrote:
Andrew,
I think it is reasonable to assume that the "Do not track" header isn't referring to this.
From http://donottrack.us/ with emphasis added.
Do Not Track is a technology and policy proposal that enables users to opt out of *tracking by websites they do not visit*, [...]
Do not track is explicitly for third party tracking. We are merely proposing to count those people who do access our sites. Note that, in this case, we are not interested in obtaining identifiers at all, so the word "track" seems to not apply.
It seems like we're looking for something like a "Do Not Log Anything At All" header. I don't believe that such a thing exists -- but if it did I think it would be good if we supported it.
-Aaron
On Tue, Jan 13, 2015 at 2:03 PM, Andrew Gray andrew.gray@dunelm.org.uk wrote:
Hi Dario, Reid,
This seems sensible enough and proposal #3 is clearly the better approach. An explicit opt-in opt-out mechanism would not be worth the effort to build and would become yet another ignored preferences setting after a few weeks...
A couple of thoughts:
- I understand the reasoning for not using do-not-track headers (#4);
however, it feels a bit odd to say "they probably don't mean us" and skip them... I can almost guarantee you'll have at least one person making a vocal fuss about not being able to opt-out without an account. If we were to honour these headers, would it make a significant change to the amount of data available? Would it likely skew it any more than leaving off logged-in users?
- Option 3 does releases one further piece of information over and
above those listed - an approximate ratio of logged in versus non-logged-in pageviews for a page. I cannot see any particular problem with doing this (and I can think of a couple of fun things to use it for) but it's probably worth being aware.
Andrew.
On 13 January 2015 at 07:26, Dario Taraborelli dtaraborelli@wikimedia.org wrote:
I’m sharing a proposal that Reid Priedhorsky and his collaborators at
Los Alamos National Laboratory recently submitted to the Wikimedia Analytics Team aimed at producing privacy-preserving geo-aggregates of Wikipedia pageview data dumps and making them available to the public and the research community. [1]
Reid and his team spearheaded the use of the public Wikipedia pageview
dumps to monitor and forecast the spread of influenza and other diseases, using language as a proxy for location. This proposal describes an aggregation strategy adding a geographical dimension to the existing dumps.
Feedback on the proposal is welcome on the lists or the project talk
page on Meta [3]
Dario
[1]
https://meta.wikimedia.org/wiki/Research:Geo-aggregation_of_Wikipedia_pagevi...
https://meta.wikimedia.org/wiki/Research_talk:Geo-aggregation_of_Wikipedia_p...
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
--
- Andrew Gray andrew.gray@dunelm.org.uk
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Hi,
On Tue, Jan 13, 2015 at 02:24:02PM -0600, Aaron Halfaker wrote:
Do Not Track is a technology and policy proposal that enables users to opt out of *tracking by websites they do not visit*, [...]
Do not track is explicitly for third party tracking. We are merely proposing to count those people who do access our sites.
The first/third party distinction and expemptions are clearly cut in technical documents (although along different lines in different commentaries). However, from my point of view, this distinction ignores real-life users.
I for one don't want to spend half an hour to figure out which parts of a page are first/third party. I'd just expect the gathering/using of data to stop altogether.
And according to [1], I am not the only user who feels this way:
Preliminary results suggest that users do not share nearly so nuanced view of tracking, but rather simply expect data collection and use to cease when they click a Do Not Track button.
One can always do better than the minimum requirements of a standard. For DNT, one can always choose to interpret it in a more restrictive way and thereby move closer to the expectation of the users of the above study.
Have fun, Christian
[1] A. M. McDonald and J. M. Peha, "Track Gap: Policy Implications of User Expectations for the `Do Not Track' Internet Privacy Feature," 39th Telecommunications Policy Research Conference (TPRC), 2011.
I think that the conclusion that you draw from that study is sketchy. They're really only asking what people think of when they read the words "Do Not Track". I'd be more interested in knowing what people expect when then look at their particular browser setting and what it is they actually hope it will accomplish. This naivety seems to come through clearly in the results. The plurality thought it had nothing to do with their relationship with the site they were visiting at all.
The most frequent answer (33%) was that Do Not Track would affect their
Internet history. For example, one participant wrote, “It would stop my browser from tracking my browsing history”
Regardless of how people interpret the words "Do", "Not" and "Track", I see a clear use case for requesting that activities not be used to track me between websites. It seems like that was what Do Not Track was designed to do.
However, I also see a clear use-case for when I would like to not be tracked at all. I'd advocate for a "Do Not Log Anything At All" header that would allow us to respect such a preference.
Really, I don't see good reason to jam one use case into something it so apparently wasn't designed for. We'd be making some bold and wasteful assumptions on behalf of our users.
-Aaron
On Tue, Jan 13, 2015 at 4:04 PM, Christian Aistleitner < christian@quelltextlich.at> wrote:
Hi,
On Tue, Jan 13, 2015 at 02:24:02PM -0600, Aaron Halfaker wrote:
Do Not Track is a technology and policy proposal that enables users to
opt
out of *tracking by websites they do not visit*, [...]
Do not track is explicitly for third party tracking. We are merely proposing to count those people who do access our sites.
The first/third party distinction and expemptions are clearly cut in technical documents (although along different lines in different commentaries). However, from my point of view, this distinction ignores real-life users.
I for one don't want to spend half an hour to figure out which parts of a page are first/third party. I'd just expect the gathering/using of data to stop altogether.
And according to [1], I am not the only user who feels this way:
Preliminary results suggest that users do not share nearly so nuanced view of tracking, but rather simply expect data collection and use to cease when they click a Do Not Track button.
One can always do better than the minimum requirements of a standard. For DNT, one can always choose to interpret it in a more restrictive way and thereby move closer to the expectation of the users of the above study.
Have fun, Christian
[1] A. M. McDonald and J. M. Peha, "Track Gap: Policy Implications of User Expectations for the `Do Not Track' Internet Privacy Feature," 39th Telecommunications Policy Research Conference (TPRC), 2011.
-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian@quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
However, I also see a clear use-case for when I would like to not be
tracked at all
I'd advocate for a "Do Not Log Anything At All" header that would allow
us to respect such a preference. I much agree with Christian's that using "do not track" for total-opt-out is a good usage of the header, implementing another one seems overkill and I doubt we are going to go that route code wise, more so when do not track is available on the javascript navigator object: https://developer.mozilla.org/en-US/docs/Web/API/navigator.doNotTrack
FYI that we have WIP changes to honor the do not track header in event logging. We should be deploying those in the near future.
We'd be making some bold and wasteful assumptions on behalf of our users.
Giving users ability to turn off all tracking by using a header called "do not track" is pretty intuitive. Assuming that many of our users equal "do not track" with "do not send my data" is is not assuming too much, "third party tracking" is a pretty technical concept.
On Tue, Jan 13, 2015 at 3:17 PM, Aaron Halfaker ahalfaker@wikimedia.org wrote:
I think that the conclusion that you draw from that study is sketchy. They're really only asking what people think of when they read the words "Do Not Track". I'd be more interested in knowing what people expect when then look at their particular browser setting and what it is they actually hope it will accomplish. This naivety seems to come through clearly in the results. The plurality thought it had nothing to do with their relationship with the site they were visiting at all.
The most frequent answer (33%) was that Do Not Track would affect their
Internet history. For example, one participant wrote, “It would stop my browser from tracking my browsing history”
Regardless of how people interpret the words "Do", "Not" and "Track", I see a clear use case for requesting that activities not be used to track me between websites. It seems like that was what Do Not Track was designed to do.
However, I also see a clear use-case for when I would like to not be tracked at all. I'd advocate for a "Do Not Log Anything At All" header that would allow us to respect such a preference.
Really, I don't see good reason to jam one use case into something it so apparently wasn't designed for. We'd be making some bold and wasteful assumptions on behalf of our users.
-Aaron
On Tue, Jan 13, 2015 at 4:04 PM, Christian Aistleitner < christian@quelltextlich.at> wrote:
Hi,
On Tue, Jan 13, 2015 at 02:24:02PM -0600, Aaron Halfaker wrote:
Do Not Track is a technology and policy proposal that enables users
to opt
out of *tracking by websites they do not visit*, [...]
Do not track is explicitly for third party tracking. We are merely proposing to count those people who do access our sites.
The first/third party distinction and expemptions are clearly cut in technical documents (although along different lines in different commentaries). However, from my point of view, this distinction ignores real-life users.
I for one don't want to spend half an hour to figure out which parts of a page are first/third party. I'd just expect the gathering/using of data to stop altogether.
And according to [1], I am not the only user who feels this way:
Preliminary results suggest that users do not share nearly so nuanced view of tracking, but rather simply expect data collection and use to cease when they click a Do Not Track button.
One can always do better than the minimum requirements of a standard. For DNT, one can always choose to interpret it in a more restrictive way and thereby move closer to the expectation of the users of the above study.
Have fun, Christian
[1] A. M. McDonald and J. M. Peha, "Track Gap: Policy Implications of User Expectations for the `Do Not Track' Internet Privacy Feature," 39th Telecommunications Policy Research Conference (TPRC), 2011.
-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian@quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Sorry if this has already been answered, but do we know how many people have DNT set?
On Tuesday, January 13, 2015, Nuria Ruiz nuria@wikimedia.org wrote:
However, I also see a clear use-case for when I would like to not be
tracked at all
I'd advocate for a "Do Not Log Anything At All" header that would allow
us to respect such a preference. I much agree with Christian's that using "do not track" for total-opt-out is a good usage of the header, implementing another one seems overkill and I doubt we are going to go that route code wise, more so when do not track is available on the javascript navigator object: https://developer.mozilla.org/en-US/docs/Web/API/navigator.doNotTrack
FYI that we have WIP changes to honor the do not track header in event logging. We should be deploying those in the near future.
We'd be making some bold and wasteful assumptions on behalf of our users.
Giving users ability to turn off all tracking by using a header called "do not track" is pretty intuitive. Assuming that many of our users equal "do not track" with "do not send my data" is is not assuming too much, "third party tracking" is a pretty technical concept.
On Tue, Jan 13, 2015 at 3:17 PM, Aaron Halfaker <ahalfaker@wikimedia.org javascript:_e(%7B%7D,'cvml','ahalfaker@wikimedia.org');> wrote:
I think that the conclusion that you draw from that study is sketchy. They're really only asking what people think of when they read the words "Do Not Track". I'd be more interested in knowing what people expect when then look at their particular browser setting and what it is they actually hope it will accomplish. This naivety seems to come through clearly in the results. The plurality thought it had nothing to do with their relationship with the site they were visiting at all.
The most frequent answer (33%) was that Do Not Track would affect their
Internet history. For example, one participant wrote, “It would stop my browser from tracking my browsing history”
Regardless of how people interpret the words "Do", "Not" and "Track", I see a clear use case for requesting that activities not be used to track me between websites. It seems like that was what Do Not Track was designed to do.
However, I also see a clear use-case for when I would like to not be tracked at all. I'd advocate for a "Do Not Log Anything At All" header that would allow us to respect such a preference.
Really, I don't see good reason to jam one use case into something it so apparently wasn't designed for. We'd be making some bold and wasteful assumptions on behalf of our users.
-Aaron
On Tue, Jan 13, 2015 at 4:04 PM, Christian Aistleitner < christian@quelltextlich.at javascript:_e(%7B%7D,'cvml','christian@quelltextlich.at');> wrote:
Hi,
On Tue, Jan 13, 2015 at 02:24:02PM -0600, Aaron Halfaker wrote:
Do Not Track is a technology and policy proposal that enables users
to opt
out of *tracking by websites they do not visit*, [...]
Do not track is explicitly for third party tracking. We are merely proposing to count those people who do access our sites.
The first/third party distinction and expemptions are clearly cut in technical documents (although along different lines in different commentaries). However, from my point of view, this distinction ignores real-life users.
I for one don't want to spend half an hour to figure out which parts of a page are first/third party. I'd just expect the gathering/using of data to stop altogether.
And according to [1], I am not the only user who feels this way:
Preliminary results suggest that users do not share nearly so nuanced view of tracking, but rather simply expect data collection and use to cease when they click a Do Not Track button.
One can always do better than the minimum requirements of a standard. For DNT, one can always choose to interpret it in a more restrictive way and thereby move closer to the expectation of the users of the above study.
Have fun, Christian
[1] A. M. McDonald and J. M. Peha, "Track Gap: Policy Implications of User Expectations for the `Do Not Track' Internet Privacy Feature," 39th Telecommunications Policy Research Conference (TPRC), 2011.
-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian@quelltextlich.at javascript:_e(%7B%7D,'cvml','christian@quelltextlich.at'); 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/
Analytics mailing list Analytics@lists.wikimedia.org javascript:_e(%7B%7D,'cvml','Analytics@lists.wikimedia.org'); https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org javascript:_e(%7B%7D,'cvml','Analytics@lists.wikimedia.org'); https://lists.wikimedia.org/mailman/listinfo/analytics
On Tue, Jan 13, 2015 at 6:08 PM, Dan Andreescu dandreescu@wikimedia.org wrote:
Sorry if this has already been answered, but do we know how many people have DNT set?
No, and there is a logistical problem that stands in the way of finding that out.. :)
On Wed, Jan 14, 2015 at 9:42 AM, Ori Livneh ori@wikimedia.org wrote:
On Tue, Jan 13, 2015 at 6:08 PM, Dan Andreescu dandreescu@wikimedia.org wrote:
Sorry if this has already been answered, but do we know how many people have DNT set?
No, and there is a logistical problem that stands in the way of finding that out.. :)
But more generally: https://dnt-dashboard.mozilla.org/
Mozilla reports single-digit percent of their users have it turned on.
Luis
On Tue, Jan 13, 2015 at 3:17 PM, Aaron Halfaker ahalfaker@wikimedia.org wrote:
They're really only asking what people think of when they read the words "Do Not Track". I'd be more interested in knowing what people expect when then look at their particular browser setting and what it is they actually hope it will accomplish.
While it's true that there is ambiguity about what users are objecting to when they turn on DNT (3rd party tracking? behavioral tracking? all data collection?), the costs of getting it wrong not symmetrical. If I object to all forms of data collection, and you collect data about me anyway, I'd be pretty upset. But if I'm OK with certain forms of data collection, and you decline to collect data about me.. meh.
Ori, I don't think you addressed the point I made about that study. They didn't ask users what they thought *their* browser setting meant and what they expected. They asked what they thought a big red button with "DO NOT TRACK" on it meant -- and the most common answer had to do with their local browser history!
Regardless, I think you make a good point. The cost of getting something wrong here may not be symmetrical, but it's not clear to me that erring on collecting absolutely no data is less costly.
For example, not collecting usage data about certain sections of our population (e.g. IE10 users where DNT is set by default) means that we don't know if our software works for them. This isn't free, and in the long-term, it can have substantial negative effects. If DNT was always disabled by default in major browsers, I would expect such biases to be minimal.
Also, I think that if a user sets DNT and expects it to do something it isn't supposed to do, we can always point them to the spec. It's a sad fact that, if you want to remain private on the web, you're going to need to inform yourself about how such things work. Just because we adopt an extreme/overly-simplistic doesn't mean that the people you really don't want to have your behavioral data will to -- but it certainly has the potential to make research & product's job much more difficult.
Really, what I'm trying to say is that if I "decline to collect data about [you]", you shouldn't say, "meh". You should be concerned about how we're not considering what works and does not work for people like you when we design, test and deploy software changes. In a way, it's like taking away your vote. And if you don't believe that, I'd like to suggest that the only alternative is that the work that I do does not bring value to our users -- and I'd beg to differ.
-Aaron
On Wed, Jan 14, 2015 at 11:40 AM, Ori Livneh ori@wikimedia.org wrote:
On Tue, Jan 13, 2015 at 3:17 PM, Aaron Halfaker ahalfaker@wikimedia.org wrote:
They're really only asking what people think of when they read the words "Do Not Track". I'd be more interested in knowing what people expect when then look at their particular browser setting and what it is they actually hope it will accomplish.
While it's true that there is ambiguity about what users are objecting to when they turn on DNT (3rd party tracking? behavioral tracking? all data collection?), the costs of getting it wrong not symmetrical. If I object to all forms of data collection, and you collect data about me anyway, I'd be pretty upset. But if I'm OK with certain forms of data collection, and you decline to collect data about me.. meh.
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
For example, not collecting usage data about certain sections of our
population (e.g. IE10 users where DNT is set by default) >means that we don't know if our software works for them. This isn't free, and in the long-term, it can have substantial negative >effects. If DNT was always disabled by default in major browsers, I would expect such biases to be minimal. IE faulty support, downright wrong support or no support of many of the web apis is no news to anyone doing web development in the last 10 years and nothing to write your mom about, really.
IE is treated it specially in many areas and we might do so in this one too if it turns out that:
- No service pack install has corrected the DNT default (sounds like no, this did not happen)
- IE10 traffic is significant. I will get those numbers as I checked browsers stats more than 6 months ago and things might have changed significantly. Last time I checked I *believe* (going from memory) we had quite a bit less traffic from ie10 than ie8.
Thanks,
Nuria
On Wed, Jan 14, 2015 at 10:07 AM, Aaron Halfaker ahalfaker@wikimedia.org wrote:
Ori, I don't think you addressed the point I made about that study. They didn't ask users what they thought *their* browser setting meant and what they expected. They asked what they thought a big red button with "DO NOT TRACK" on it meant -- and the most common answer had to do with their local browser history!
Regardless, I think you make a good point. The cost of getting something wrong here may not be symmetrical, but it's not clear to me that erring on collecting absolutely no data is less costly.
For example, not collecting usage data about certain sections of our population (e.g. IE10 users where DNT is set by default) means that we don't know if our software works for them. This isn't free, and in the long-term, it can have substantial negative effects. If DNT was always disabled by default in major browsers, I would expect such biases to be minimal.
Also, I think that if a user sets DNT and expects it to do something it isn't supposed to do, we can always point them to the spec. It's a sad fact that, if you want to remain private on the web, you're going to need to inform yourself about how such things work. Just because we adopt an extreme/overly-simplistic doesn't mean that the people you really don't want to have your behavioral data will to -- but it certainly has the potential to make research & product's job much more difficult.
Really, what I'm trying to say is that if I "decline to collect data about [you]", you shouldn't say, "meh". You should be concerned about how we're not considering what works and does not work for people like you when we design, test and deploy software changes. In a way, it's like taking away your vote. And if you don't believe that, I'd like to suggest that the only alternative is that the work that I do does not bring value to our users -- and I'd beg to differ.
-Aaron
On Wed, Jan 14, 2015 at 11:40 AM, Ori Livneh ori@wikimedia.org wrote:
On Tue, Jan 13, 2015 at 3:17 PM, Aaron Halfaker ahalfaker@wikimedia.org wrote:
They're really only asking what people think of when they read the words "Do Not Track". I'd be more interested in knowing what people expect when then look at their particular browser setting and what it is they actually hope it will accomplish.
While it's true that there is ambiguity about what users are objecting to when they turn on DNT (3rd party tracking? behavioral tracking? all data collection?), the costs of getting it wrong not symmetrical. If I object to all forms of data collection, and you collect data about me anyway, I'd be pretty upset. But if I'm OK with certain forms of data collection, and you decline to collect data about me.. meh.
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
And, IE11? 12? My point is that yes, we can go about writing a lot of exceptions for specific use cases, and coming up with solutions for each browser's DNT idiosyncracies, but the costs of that trade-off increase the more we have to support.
I'd much rather we built a uniform system that asked users to explicitly opt-out, and made clear what they were opting out of: it's quite clear from both the public and private discussions around DNT that there is a big detachment between user expectations of DNT and what the protocol actually does, and so we should probably avoid treating that protocol as a flag.
On 14 January 2015 at 13:45, Nuria Ruiz nuria@wikimedia.org wrote:
For example, not collecting usage data about certain sections of our population (e.g. IE10 users where DNT is set by default) >means that we don't know if our software works for them. This isn't free, and in the long-term, it can have substantial negative >effects. If DNT was always disabled by default in major browsers, I would expect such biases to be minimal.
IE faulty support, downright wrong support or no support of many of the web apis is no news to anyone doing web development in the last 10 years and nothing to write your mom about, really.
IE is treated it specially in many areas and we might do so in this one too if it turns out that:
- No service pack install has corrected the DNT default (sounds like no,
this did not happen)
- IE10 traffic is significant. I will get those numbers as I checked
browsers stats more than 6 months ago and things might have changed significantly. Last time I checked I *believe* (going from memory) we had quite a bit less traffic from ie10 than ie8.
Thanks,
Nuria
On Wed, Jan 14, 2015 at 10:07 AM, Aaron Halfaker ahalfaker@wikimedia.org wrote:
Ori, I don't think you addressed the point I made about that study. They didn't ask users what they thought *their* browser setting meant and what they expected. They asked what they thought a big red button with "DO NOT TRACK" on it meant -- and the most common answer had to do with their local browser history!
Regardless, I think you make a good point. The cost of getting something wrong here may not be symmetrical, but it's not clear to me that erring on collecting absolutely no data is less costly.
For example, not collecting usage data about certain sections of our population (e.g. IE10 users where DNT is set by default) means that we don't know if our software works for them. This isn't free, and in the long-term, it can have substantial negative effects. If DNT was always disabled by default in major browsers, I would expect such biases to be minimal.
Also, I think that if a user sets DNT and expects it to do something it isn't supposed to do, we can always point them to the spec. It's a sad fact that, if you want to remain private on the web, you're going to need to inform yourself about how such things work. Just because we adopt an extreme/overly-simplistic doesn't mean that the people you really don't want to have your behavioral data will to -- but it certainly has the potential to make research & product's job much more difficult.
Really, what I'm trying to say is that if I "decline to collect data about [you]", you shouldn't say, "meh". You should be concerned about how we're not considering what works and does not work for people like you when we design, test and deploy software changes. In a way, it's like taking away your vote. And if you don't believe that, I'd like to suggest that the only alternative is that the work that I do does not bring value to our users -- and I'd beg to differ.
-Aaron
On Wed, Jan 14, 2015 at 11:40 AM, Ori Livneh ori@wikimedia.org wrote:
On Tue, Jan 13, 2015 at 3:17 PM, Aaron Halfaker ahalfaker@wikimedia.org wrote:
They're really only asking what people think of when they read the words "Do Not Track". I'd be more interested in knowing what people expect when then look at their particular browser setting and what it is they actually hope it will accomplish.
While it's true that there is ambiguity about what users are objecting to when they turn on DNT (3rd party tracking? behavioral tracking? all data collection?), the costs of getting it wrong not symmetrical. If I object to all forms of data collection, and you collect data about me anyway, I'd be pretty upset. But if I'm OK with certain forms of data collection, and you decline to collect data about me.. meh.
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
And, IE11? From what I see it is not subjected to the same restrictions that 10, note
that issues with older browsers fade away for browsers with auto-update.
On Wed, Jan 14, 2015 at 11:18 AM, Oliver Keyes okeyes@wikimedia.org wrote:
And, IE11? 12? My point is that yes, we can go about writing a lot of exceptions for specific use cases, and coming up with solutions for each browser's DNT idiosyncracies, but the costs of that trade-off increase the more we have to support.
I'd much rather we built a uniform system that asked users to explicitly opt-out, and made clear what they were opting out of: it's quite clear from both the public and private discussions around DNT that there is a big detachment between user expectations of DNT and what the protocol actually does, and so we should probably avoid treating that protocol as a flag.
On 14 January 2015 at 13:45, Nuria Ruiz nuria@wikimedia.org wrote:
For example, not collecting usage data about certain sections of our population (e.g. IE10 users where DNT is set by default) >means that we don't know if our software works for them. This isn't free, and in the long-term, it can have substantial negative >effects. If DNT was always disabled by default in major browsers, I would expect such biases to be minimal.
IE faulty support, downright wrong support or no support of many of the
web
apis is no news to anyone doing web development in the last 10 years and nothing to write your mom about, really.
IE is treated it specially in many areas and we might do so in this one
too
if it turns out that:
- No service pack install has corrected the DNT default (sounds like no,
this did not happen)
- IE10 traffic is significant. I will get those numbers as I checked
browsers stats more than 6 months ago and things might have changed significantly. Last time I checked I *believe* (going from
memory)
we had quite a bit less traffic from ie10 than ie8.
Thanks,
Nuria
On Wed, Jan 14, 2015 at 10:07 AM, Aaron Halfaker <
ahalfaker@wikimedia.org>
wrote:
Ori, I don't think you addressed the point I made about that study.
They
didn't ask users what they thought *their* browser setting meant and
what
they expected. They asked what they thought a big red button with "DO
NOT
TRACK" on it meant -- and the most common answer had to do with their
local
browser history!
Regardless, I think you make a good point. The cost of getting
something
wrong here may not be symmetrical, but it's not clear to me that erring
on
collecting absolutely no data is less costly.
For example, not collecting usage data about certain sections of our population (e.g. IE10 users where DNT is set by default) means that we
don't
know if our software works for them. This isn't free, and in the
long-term,
it can have substantial negative effects. If DNT was always disabled by default in major browsers, I would expect such biases to be minimal.
Also, I think that if a user sets DNT and expects it to do something it isn't supposed to do, we can always point them to the spec. It's a sad fact that, if you want to remain private on the web, you're going to
need to
inform yourself about how such things work. Just because we adopt an extreme/overly-simplistic doesn't mean that the people you really don't
want
to have your behavioral data will to -- but it certainly has the
potential
to make research & product's job much more difficult.
Really, what I'm trying to say is that if I "decline to collect data
about
[you]", you shouldn't say, "meh". You should be concerned about how
we're
not considering what works and does not work for people like you when we design, test and deploy software changes. In a way, it's like taking
away
your vote. And if you don't believe that, I'd like to suggest that the
only
alternative is that the work that I do does not bring value to our
users --
and I'd beg to differ.
-Aaron
On Wed, Jan 14, 2015 at 11:40 AM, Ori Livneh ori@wikimedia.org wrote:
On Tue, Jan 13, 2015 at 3:17 PM, Aaron Halfaker <
ahalfaker@wikimedia.org>
wrote:
They're really only asking what people think of when they read the
words
"Do Not Track". I'd be more interested in knowing what people expect
when
then look at their particular browser setting and what it is they
actually
hope it will accomplish.
While it's true that there is ambiguity about what users are objecting
to
when they turn on DNT (3rd party tracking? behavioral tracking? all
data
collection?), the costs of getting it wrong not symmetrical. If I
object to
all forms of data collection, and you collect data about me anyway,
I'd be
pretty upset. But if I'm OK with certain forms of data collection, and
you
decline to collect data about me.. meh.
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
it'squite clear from both the public and private discussions around DNT that there is a big detachment between user expectations of DNT and what the protocol actually does, and so we should probably avoid treating that protocol as a flag.
On a less technical amore philosophical note I think that there is nothing preventing is from taking a strong stand and saying "do not track" equals "do not collect".
The EFF on this topic:
"Intuitively, users who we've talked to want Do Not Track to provide meaningful limits on collection and retention of data. From the user's perspective, sending the DNT browser signal to websites should indicate: don't keep any records of my information, and collect the *bare minimum* amount of information required to provide me with the service that you are offering."
Excellent graphical representation of the user expectations vs what is going on at the w3 standards group: https://www.eff.org/files/images_insert/dnt_chart_0.jpg
On Wed, Jan 14, 2015 at 11:18 AM, Oliver Keyes okeyes@wikimedia.org wrote:
And, IE11? 12? My point is that yes, we can go about writing a lot of exceptions for specific use cases, and coming up with solutions for each browser's DNT idiosyncracies, but the costs of that trade-off increase the more we have to support.
I'd much rather we built a uniform system that asked users to explicitly opt-out, and made clear what they were opting out of: it's quite clear from both the public and private discussions around DNT that there is a big detachment between user expectations of DNT and what the protocol actually does, and so we should probably avoid treating that protocol as a flag.
On 14 January 2015 at 13:45, Nuria Ruiz nuria@wikimedia.org wrote:
For example, not collecting usage data about certain sections of our population (e.g. IE10 users where DNT is set by default) >means that we don't know if our software works for them. This isn't free, and in the long-term, it can have substantial negative >effects. If DNT was always disabled by default in major browsers, I would expect such biases to be minimal.
IE faulty support, downright wrong support or no support of many of the
web
apis is no news to anyone doing web development in the last 10 years and nothing to write your mom about, really.
IE is treated it specially in many areas and we might do so in this one
too
if it turns out that:
- No service pack install has corrected the DNT default (sounds like no,
this did not happen)
- IE10 traffic is significant. I will get those numbers as I checked
browsers stats more than 6 months ago and things might have changed significantly. Last time I checked I *believe* (going from
memory)
we had quite a bit less traffic from ie10 than ie8.
Thanks,
Nuria
On Wed, Jan 14, 2015 at 10:07 AM, Aaron Halfaker <
ahalfaker@wikimedia.org>
wrote:
Ori, I don't think you addressed the point I made about that study.
They
didn't ask users what they thought *their* browser setting meant and
what
they expected. They asked what they thought a big red button with "DO
NOT
TRACK" on it meant -- and the most common answer had to do with their
local
browser history!
Regardless, I think you make a good point. The cost of getting
something
wrong here may not be symmetrical, but it's not clear to me that erring
on
collecting absolutely no data is less costly.
For example, not collecting usage data about certain sections of our population (e.g. IE10 users where DNT is set by default) means that we
don't
know if our software works for them. This isn't free, and in the
long-term,
it can have substantial negative effects. If DNT was always disabled by default in major browsers, I would expect such biases to be minimal.
Also, I think that if a user sets DNT and expects it to do something it isn't supposed to do, we can always point them to the spec. It's a sad fact that, if you want to remain private on the web, you're going to
need to
inform yourself about how such things work. Just because we adopt an extreme/overly-simplistic doesn't mean that the people you really don't
want
to have your behavioral data will to -- but it certainly has the
potential
to make research & product's job much more difficult.
Really, what I'm trying to say is that if I "decline to collect data
about
[you]", you shouldn't say, "meh". You should be concerned about how
we're
not considering what works and does not work for people like you when we design, test and deploy software changes. In a way, it's like taking
away
your vote. And if you don't believe that, I'd like to suggest that the
only
alternative is that the work that I do does not bring value to our
users --
and I'd beg to differ.
-Aaron
On Wed, Jan 14, 2015 at 11:40 AM, Ori Livneh ori@wikimedia.org wrote:
On Tue, Jan 13, 2015 at 3:17 PM, Aaron Halfaker <
ahalfaker@wikimedia.org>
wrote:
They're really only asking what people think of when they read the
words
"Do Not Track". I'd be more interested in knowing what people expect
when
then look at their particular browser setting and what it is they
actually
hope it will accomplish.
While it's true that there is ambiguity about what users are objecting
to
when they turn on DNT (3rd party tracking? behavioral tracking? all
data
collection?), the costs of getting it wrong not symmetrical. If I
object to
all forms of data collection, and you collect data about me anyway,
I'd be
pretty upset. But if I'm OK with certain forms of data collection, and
you
decline to collect data about me.. meh.
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Agreed! There's no philosophical blocker. In a universe in which DNT was uniformly treated, and uniformly opt-in, without substantial variations in status between demographies, I would have absolutely no problem with equating the two. As a user, I until very recently assumed DNT == DNC.
Unfortunately we do not live in that universe. If we want to transition to it, relying on DNT will not allow us to strike a balance between research and privacy that doesn't totally tank one of the two.
On 14 January 2015 at 17:39, Nuria Ruiz nuria@wikimedia.org wrote:
it'squite clear from both the public and private discussions around DNT that there is a big detachment between user expectations of DNT and what the protocol actually does, and so we should probably avoid treating that protocol as a flag.
On a less technical amore philosophical note I think that there is nothing preventing is from taking a strong stand and saying "do not track" equals "do not collect".
The EFF on this topic:
"Intuitively, users who we've talked to want Do Not Track to provide meaningful limits on collection and retention of data. From the user's perspective, sending the DNT browser signal to websites should indicate: don't keep any records of my information, and collect the bare minimum amount of information required to provide me with the service that you are offering."
Excellent graphical representation of the user expectations vs what is going on at the w3 standards group: https://www.eff.org/files/images_insert/dnt_chart_0.jpg
On Wed, Jan 14, 2015 at 11:18 AM, Oliver Keyes okeyes@wikimedia.org wrote:
And, IE11? 12? My point is that yes, we can go about writing a lot of exceptions for specific use cases, and coming up with solutions for each browser's DNT idiosyncracies, but the costs of that trade-off increase the more we have to support.
I'd much rather we built a uniform system that asked users to explicitly opt-out, and made clear what they were opting out of: it's quite clear from both the public and private discussions around DNT that there is a big detachment between user expectations of DNT and what the protocol actually does, and so we should probably avoid treating that protocol as a flag.
On 14 January 2015 at 13:45, Nuria Ruiz nuria@wikimedia.org wrote:
For example, not collecting usage data about certain sections of our population (e.g. IE10 users where DNT is set by default) >means that we don't know if our software works for them. This isn't free, and in the long-term, it can have substantial negative >effects. If DNT was always disabled by default in major browsers, I would expect such biases to be minimal.
IE faulty support, downright wrong support or no support of many of the web apis is no news to anyone doing web development in the last 10 years and nothing to write your mom about, really.
IE is treated it specially in many areas and we might do so in this one too if it turns out that:
- No service pack install has corrected the DNT default (sounds like no,
this did not happen)
- IE10 traffic is significant. I will get those numbers as I checked
browsers stats more than 6 months ago and things might have changed significantly. Last time I checked I *believe* (going from memory) we had quite a bit less traffic from ie10 than ie8.
Thanks,
Nuria
On Wed, Jan 14, 2015 at 10:07 AM, Aaron Halfaker ahalfaker@wikimedia.org wrote:
Ori, I don't think you addressed the point I made about that study. They didn't ask users what they thought *their* browser setting meant and what they expected. They asked what they thought a big red button with "DO NOT TRACK" on it meant -- and the most common answer had to do with their local browser history!
Regardless, I think you make a good point. The cost of getting something wrong here may not be symmetrical, but it's not clear to me that erring on collecting absolutely no data is less costly.
For example, not collecting usage data about certain sections of our population (e.g. IE10 users where DNT is set by default) means that we don't know if our software works for them. This isn't free, and in the long-term, it can have substantial negative effects. If DNT was always disabled by default in major browsers, I would expect such biases to be minimal.
Also, I think that if a user sets DNT and expects it to do something it isn't supposed to do, we can always point them to the spec. It's a sad fact that, if you want to remain private on the web, you're going to need to inform yourself about how such things work. Just because we adopt an extreme/overly-simplistic doesn't mean that the people you really don't want to have your behavioral data will to -- but it certainly has the potential to make research & product's job much more difficult.
Really, what I'm trying to say is that if I "decline to collect data about [you]", you shouldn't say, "meh". You should be concerned about how we're not considering what works and does not work for people like you when we design, test and deploy software changes. In a way, it's like taking away your vote. And if you don't believe that, I'd like to suggest that the only alternative is that the work that I do does not bring value to our users -- and I'd beg to differ.
-Aaron
On Wed, Jan 14, 2015 at 11:40 AM, Ori Livneh ori@wikimedia.org wrote:
On Tue, Jan 13, 2015 at 3:17 PM, Aaron Halfaker ahalfaker@wikimedia.org wrote:
They're really only asking what people think of when they read the words "Do Not Track". I'd be more interested in knowing what people expect when then look at their particular browser setting and what it is they actually hope it will accomplish.
While it's true that there is ambiguity about what users are objecting to when they turn on DNT (3rd party tracking? behavioral tracking? all data collection?), the costs of getting it wrong not symmetrical. If I object to all forms of data collection, and you collect data about me anyway, I'd be pretty upset. But if I'm OK with certain forms of data collection, and you decline to collect data about me.. meh.
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Hi,
On Wed, Jan 14, 2015 at 12:07:57PM -0600, Aaron Halfaker wrote:
For example, not collecting usage data about certain sections of our population (e.g. IE10 users where DNT is set by default) means that we don't know if our software works for them.
If WMF's main form of QA was through automated usage data collection, you'd have a point.
But actually, I think WMF is doing better than that.
From my point of view, a central pillar in QA is “software getting tested”. That's happening widely across WMF. Both manually and automated. It's great already and getting better every day.
And for me the main QA ingredient is listening to feedback from the users. Besides studies and dog-fooding, WMF's bugtracker is a testament to that and contains reports that “$X is not working on browser $Y” or “$X needs to also do $Z”. And that's really great!
To me, user behaviour data collection is a way to support and assist the above two. But it is not a requirement when trying to determine “if our software works for them”.
Users are sending us emails about issues, come to IRC to discuss issues, file a ticket, or they just tell someone. All without having their usage data collected.
I am convinced “IE10 users that do not want to unset DNT” are no exception to that.
Have fun, Christian
P.S.: I for one received bug reports from IE10 users. (But I do not know whether or they used DNT.)
Christian,
It seems that people are well enough informed by the field studies that our team runs to want us to continue to run them. In fact, demand has sky-rocketed both within and outside of the Wikimedia Foundation. You hold a minority opinion that testing software in the field is unnecessary. Yet, field tests are considered a best-practice and have become a critical part of our strategy for minimizing the disruption (and maximizing the benefits) of software changes. I don't think that many people would appreciate your proposed strategy of releasing the software and waiting for people to complain. Given how difficult it is to develop good user-facing software, it's likely that every major deployment would be disruptive if we adopted that strategy. I can speak for a few disruptions that my research helped prevent and some opportunities that it helped us explore.
Allow me to share a specific example. In this study[1], we found that telling anonymous editors to register dropped their productivity by *25%.* Yet we didn't identify substantial issues in user testing. If we had not run this field experiment, we might have deployed the change thinking that we were improving Wikipedia when we were really driving good editors away. During the experiment, we received no substantial negative feedback
For a large collection of field experiments that were used to iterate on Wikimedia software, see: https://meta.wikimedia.org/wiki/Growth
Really, what I want to say is this: If you want to improve privacy protections, I am your ally. We're merely disagreeing about whether it is good to assume that DNT means something it wasn't intended to mean or not. However, when you say that my work has no value, it's hard to talk to you productively because, honestly, I don't think your opinion is well-informed.
1. https://meta.wikimedia.org/wiki/Research:Asking_anonymous_editors_to_registe...
-Aaron
On Thu, Jan 15, 2015 at 7:22 AM, Christian Aistleitner < christian@quelltextlich.at> wrote:
Hi,
On Wed, Jan 14, 2015 at 12:07:57PM -0600, Aaron Halfaker wrote:
For example, not collecting usage data about certain sections of our population (e.g. IE10 users where DNT is set by default) means that we don't know if our software works for them.
If WMF's main form of QA was through automated usage data collection, you'd have a point.
But actually, I think WMF is doing better than that.
From my point of view, a central pillar in QA is “software getting tested”. That's happening widely across WMF. Both manually and automated. It's great already and getting better every day.
And for me the main QA ingredient is listening to feedback from the users. Besides studies and dog-fooding, WMF's bugtracker is a testament to that and contains reports that “$X is not working on browser $Y” or “$X needs to also do $Z”. And that's really great!
To me, user behaviour data collection is a way to support and assist the above two. But it is not a requirement when trying to determine “if our software works for them”.
Users are sending us emails about issues, come to IRC to discuss issues, file a ticket, or they just tell someone. All without having their usage data collected.
I am convinced “IE10 users that do not want to unset DNT” are no exception to that.
Have fun, Christian
P.S.: I for one received bug reports from IE10 users. (But I do not know whether or they used DNT.)
-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian@quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Here's what we all agree on: We want the users of Wikimedia sites to have more control over whether their data is used for application improvement purposes. To be clear, we're not talking about data collected and deleted for operational purposes.
Based on our conversations, we have three choices.
1) We use the divide in interpreting what DNT means to interpret it in a more restrictive way. This has its own advantages and disadvantages as discussed in this list and others.
2) We add an opt-out option that users can use to signal that they don't want any data from them to be collected by us (except for operational purposes). The nice thing about this option is that Wikimedia has control over it and if browser X decides to change their DNT defaults (IE10 example Aaron brought up), we can stay consistent in the choices we provide to users. The downside is that I know it will take some time to implement this and we don't have an interim solution.
3) We use DNT as an interim solution and interpret DNT as "do not log anything from me" and work towards an opt-out option.
If we have capacity to go with option (2) and have it ready in few months, I'd like us to go with that option. Otherwise, option (3) is a reasonable option to me.
Leila
On Thu, Jan 15, 2015 at 7:23 AM, Aaron Halfaker ahalfaker@wikimedia.org wrote:
Christian,
It seems that people are well enough informed by the field studies that our team runs to want us to continue to run them. In fact, demand has sky-rocketed both within and outside of the Wikimedia Foundation. You hold a minority opinion that testing software in the field is unnecessary. Yet, field tests are considered a best-practice and have become a critical part of our strategy for minimizing the disruption (and maximizing the benefits) of software changes. I don't think that many people would appreciate your proposed strategy of releasing the software and waiting for people to complain. Given how difficult it is to develop good user-facing software, it's likely that every major deployment would be disruptive if we adopted that strategy. I can speak for a few disruptions that my research helped prevent and some opportunities that it helped us explore.
Allow me to share a specific example. In this study[1], we found that telling anonymous editors to register dropped their productivity by *25%.* Yet we didn't identify substantial issues in user testing. If we had not run this field experiment, we might have deployed the change thinking that we were improving Wikipedia when we were really driving good editors away. During the experiment, we received no substantial negative feedback
For a large collection of field experiments that were used to iterate on Wikimedia software, see: https://meta.wikimedia.org/wiki/Growth
Really, what I want to say is this: If you want to improve privacy protections, I am your ally. We're merely disagreeing about whether it is good to assume that DNT means something it wasn't intended to mean or not. However, when you say that my work has no value, it's hard to talk to you productively because, honestly, I don't think your opinion is well-informed.
https://meta.wikimedia.org/wiki/Research:Asking_anonymous_editors_to_registe...
-Aaron
On Thu, Jan 15, 2015 at 7:22 AM, Christian Aistleitner < christian@quelltextlich.at> wrote:
Hi,
On Wed, Jan 14, 2015 at 12:07:57PM -0600, Aaron Halfaker wrote:
For example, not collecting usage data about certain sections of our population (e.g. IE10 users where DNT is set by default) means that we don't know if our software works for them.
If WMF's main form of QA was through automated usage data collection, you'd have a point.
But actually, I think WMF is doing better than that.
From my point of view, a central pillar in QA is “software getting tested”. That's happening widely across WMF. Both manually and automated. It's great already and getting better every day.
And for me the main QA ingredient is listening to feedback from the users. Besides studies and dog-fooding, WMF's bugtracker is a testament to that and contains reports that “$X is not working on browser $Y” or “$X needs to also do $Z”. And that's really great!
To me, user behaviour data collection is a way to support and assist the above two. But it is not a requirement when trying to determine “if our software works for them”.
Users are sending us emails about issues, come to IRC to discuss issues, file a ticket, or they just tell someone. All without having their usage data collected.
I am convinced “IE10 users that do not want to unset DNT” are no exception to that.
Have fun, Christian
P.S.: I for one received bug reports from IE10 users. (But I do not know whether or they used DNT.)
-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian@quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
- We add an opt-out option that users can use to signal that they don't
want any data from them to be collected by us (except >for operational purposes). The nice thing about this option is that Wikimedia has control over it and if browser X decides to >change their DNT defaults (IE10 example Aaron brought up), we can stay consistent in the choices we provide to users. The >downside is that I know it will take some time to implement this and we don't have an interim solution.
Note that any opt-out solution implemented needs a level of persistence, and that will also be subjected to browser support. If we persist the opt out in local storage, for example, that is by no means supported by all browsers. If we use a cookie, it will expire, be deleted, and we will need to ask the user again. Both these options include plenty UX/UI work that I very much doubt will take place in the near future (this quarter).
As I have stated before I will got for solution 1) given that we can start implementing it right now and it is very intuitive to explain to our users. We can treat browser oddities and lack of support as we do in other areas, we can for example not honor do not track for IE10 (or the opposite). These are trade offs that we are used to make when it comes to browser support.
On Thu, Jan 15, 2015 at 11:47 AM, Leila Zia leila@wikimedia.org wrote:
Here's what we all agree on: We want the users of Wikimedia sites to have more control over whether their data is used for application improvement purposes. To be clear, we're not talking about data collected and deleted for operational purposes.
Based on our conversations, we have three choices.
- We use the divide in interpreting what DNT means to interpret it in a
more restrictive way. This has its own advantages and disadvantages as discussed in this list and others.
- We add an opt-out option that users can use to signal that they don't
want any data from them to be collected by us (except for operational purposes). The nice thing about this option is that Wikimedia has control over it and if browser X decides to change their DNT defaults (IE10 example Aaron brought up), we can stay consistent in the choices we provide to users. The downside is that I know it will take some time to implement this and we don't have an interim solution.
- We use DNT as an interim solution and interpret DNT as "do not log
anything from me" and work towards an opt-out option.
If we have capacity to go with option (2) and have it ready in few months, I'd like us to go with that option. Otherwise, option (3) is a reasonable option to me.
Leila
On Thu, Jan 15, 2015 at 7:23 AM, Aaron Halfaker ahalfaker@wikimedia.org wrote:
Christian,
It seems that people are well enough informed by the field studies that our team runs to want us to continue to run them. In fact, demand has sky-rocketed both within and outside of the Wikimedia Foundation. You hold a minority opinion that testing software in the field is unnecessary. Yet, field tests are considered a best-practice and have become a critical part of our strategy for minimizing the disruption (and maximizing the benefits) of software changes. I don't think that many people would appreciate your proposed strategy of releasing the software and waiting for people to complain. Given how difficult it is to develop good user-facing software, it's likely that every major deployment would be disruptive if we adopted that strategy. I can speak for a few disruptions that my research helped prevent and some opportunities that it helped us explore.
Allow me to share a specific example. In this study[1], we found that telling anonymous editors to register dropped their productivity by *25%.* Yet we didn't identify substantial issues in user testing. If we had not run this field experiment, we might have deployed the change thinking that we were improving Wikipedia when we were really driving good editors away. During the experiment, we received no substantial negative feedback
For a large collection of field experiments that were used to iterate on Wikimedia software, see: https://meta.wikimedia.org/wiki/Growth
Really, what I want to say is this: If you want to improve privacy protections, I am your ally. We're merely disagreeing about whether it is good to assume that DNT means something it wasn't intended to mean or not. However, when you say that my work has no value, it's hard to talk to you productively because, honestly, I don't think your opinion is well-informed.
https://meta.wikimedia.org/wiki/Research:Asking_anonymous_editors_to_registe...
-Aaron
On Thu, Jan 15, 2015 at 7:22 AM, Christian Aistleitner < christian@quelltextlich.at> wrote:
Hi,
On Wed, Jan 14, 2015 at 12:07:57PM -0600, Aaron Halfaker wrote:
For example, not collecting usage data about certain sections of our population (e.g. IE10 users where DNT is set by default) means that we don't know if our software works for them.
If WMF's main form of QA was through automated usage data collection, you'd have a point.
But actually, I think WMF is doing better than that.
From my point of view, a central pillar in QA is “software getting tested”. That's happening widely across WMF. Both manually and automated. It's great already and getting better every day.
And for me the main QA ingredient is listening to feedback from the users. Besides studies and dog-fooding, WMF's bugtracker is a testament to that and contains reports that “$X is not working on browser $Y” or “$X needs to also do $Z”. And that's really great!
To me, user behaviour data collection is a way to support and assist the above two. But it is not a requirement when trying to determine “if our software works for them”.
Users are sending us emails about issues, come to IRC to discuss issues, file a ticket, or they just tell someone. All without having their usage data collected.
I am convinced “IE10 users that do not want to unset DNT” are no exception to that.
Have fun, Christian
P.S.: I for one received bug reports from IE10 users. (But I do not know whether or they used DNT.)
-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian@quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
I like how this discussion is progressing towards a solution.
I can't actually think of any practical opt out schemes _except_ for DNT at this time. Any thoughts?
-Toby
On Thu, Jan 15, 2015 at 12:07 PM, Nuria Ruiz nuria@wikimedia.org wrote:
- We add an opt-out option that users can use to signal that they don't
want any data from them to be collected by us (except >for operational purposes). The nice thing about this option is that Wikimedia has control over it and if browser X decides to >change their DNT defaults (IE10 example Aaron brought up), we can stay consistent in the choices we provide to users. The >downside is that I know it will take some time to implement this and we don't have an interim solution.
Note that any opt-out solution implemented needs a level of persistence, and that will also be subjected to browser support. If we persist the opt out in local storage, for example, that is by no means supported by all browsers. If we use a cookie, it will expire, be deleted, and we will need to ask the user again. Both these options include plenty UX/UI work that I very much doubt will take place in the near future (this quarter).
As I have stated before I will got for solution 1) given that we can start implementing it right now and it is very intuitive to explain to our users. We can treat browser oddities and lack of support as we do in other areas, we can for example not honor do not track for IE10 (or the opposite). These are trade offs that we are used to make when it comes to browser support.
On Thu, Jan 15, 2015 at 11:47 AM, Leila Zia leila@wikimedia.org wrote:
Here's what we all agree on: We want the users of Wikimedia sites to have more control over whether their data is used for application improvement purposes. To be clear, we're not talking about data collected and deleted for operational purposes.
Based on our conversations, we have three choices.
- We use the divide in interpreting what DNT means to interpret it in a
more restrictive way. This has its own advantages and disadvantages as discussed in this list and others.
- We add an opt-out option that users can use to signal that they don't
want any data from them to be collected by us (except for operational purposes). The nice thing about this option is that Wikimedia has control over it and if browser X decides to change their DNT defaults (IE10 example Aaron brought up), we can stay consistent in the choices we provide to users. The downside is that I know it will take some time to implement this and we don't have an interim solution.
- We use DNT as an interim solution and interpret DNT as "do not log
anything from me" and work towards an opt-out option.
If we have capacity to go with option (2) and have it ready in few months, I'd like us to go with that option. Otherwise, option (3) is a reasonable option to me.
Leila
On Thu, Jan 15, 2015 at 7:23 AM, Aaron Halfaker ahalfaker@wikimedia.org wrote:
Christian,
It seems that people are well enough informed by the field studies that our team runs to want us to continue to run them. In fact, demand has sky-rocketed both within and outside of the Wikimedia Foundation. You hold a minority opinion that testing software in the field is unnecessary. Yet, field tests are considered a best-practice and have become a critical part of our strategy for minimizing the disruption (and maximizing the benefits) of software changes. I don't think that many people would appreciate your proposed strategy of releasing the software and waiting for people to complain. Given how difficult it is to develop good user-facing software, it's likely that every major deployment would be disruptive if we adopted that strategy. I can speak for a few disruptions that my research helped prevent and some opportunities that it helped us explore.
Allow me to share a specific example. In this study[1], we found that telling anonymous editors to register dropped their productivity by *25%.* Yet we didn't identify substantial issues in user testing. If we had not run this field experiment, we might have deployed the change thinking that we were improving Wikipedia when we were really driving good editors away. During the experiment, we received no substantial negative feedback
For a large collection of field experiments that were used to iterate on Wikimedia software, see: https://meta.wikimedia.org/wiki/Growth
Really, what I want to say is this: If you want to improve privacy protections, I am your ally. We're merely disagreeing about whether it is good to assume that DNT means something it wasn't intended to mean or not. However, when you say that my work has no value, it's hard to talk to you productively because, honestly, I don't think your opinion is well-informed.
https://meta.wikimedia.org/wiki/Research:Asking_anonymous_editors_to_registe...
-Aaron
On Thu, Jan 15, 2015 at 7:22 AM, Christian Aistleitner < christian@quelltextlich.at> wrote:
Hi,
On Wed, Jan 14, 2015 at 12:07:57PM -0600, Aaron Halfaker wrote:
For example, not collecting usage data about certain sections of our population (e.g. IE10 users where DNT is set by default) means that we don't know if our software works for them.
If WMF's main form of QA was through automated usage data collection, you'd have a point.
But actually, I think WMF is doing better than that.
From my point of view, a central pillar in QA is “software getting tested”. That's happening widely across WMF. Both manually and automated. It's great already and getting better every day.
And for me the main QA ingredient is listening to feedback from the users. Besides studies and dog-fooding, WMF's bugtracker is a testament to that and contains reports that “$X is not working on browser $Y” or “$X needs to also do $Z”. And that's really great!
To me, user behaviour data collection is a way to support and assist the above two. But it is not a requirement when trying to determine “if our software works for them”.
Users are sending us emails about issues, come to IRC to discuss issues, file a ticket, or they just tell someone. All without having their usage data collected.
I am convinced “IE10 users that do not want to unset DNT” are no exception to that.
Have fun, Christian
P.S.: I for one received bug reports from IE10 users. (But I do not know whether or they used DNT.)
-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian@quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Hi Aaron,
On Thu, Jan 15, 2015 at 09:23:13AM -0600, Aaron Halfaker wrote:
You hold a minority opinion that testing software in the field is unnecessary.
Hey, that's not what I've said :-)
And this mis-interpretation of my previous email pretty much makes the rest of your argument moot from my point of view.
We're merely disagreeing about whether it is good to assume that DNT means [...]
And I respect that we have different opinions about DNT. No doubt there.
However, when you say that my work has no value, [...]
??? Again ... I think you're misreading my email. I never said that your work has no value.
Have fun, Christian
Christian, I appreciate your response, but if you only say how I misunderstood you without suggestion how I might have understood you better, I don't see a way to continue the conversation.
On Thu, Jan 15, 2015 at 2:48 PM, Christian Aistleitner < christian@quelltextlich.at> wrote:
Hi Aaron,
On Thu, Jan 15, 2015 at 09:23:13AM -0600, Aaron Halfaker wrote:
You hold a minority opinion that testing software in the field is unnecessary.
Hey, that's not what I've said :-)
And this mis-interpretation of my previous email pretty much makes the rest of your argument moot from my point of view.
We're merely disagreeing about whether it is good to assume that DNT means [...]
And I respect that we have different opinions about DNT. No doubt there.
However, when you say that my work has no value, [...]
??? Again ... I think you're misreading my email. I never said that your work has no value.
Have fun, Christian
-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian@quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Ah. I think I see the confusion. When I referred to knowing whether the software "works" for a group of users or not, I'm talking about something more than technical requirements. Even software that is technically functioning can fail to serve its intended purpose. The work we do with field studies surfaces this. That's the point I was trying to make with the anon example.
What I find concerning is the idea that a biased subset of our users would be categorically ignored for this type of evaluation. If you agree with me that such evaluation is valuable to our users, I think you ought to also find such categorical exclusions concerning. On Jan 15, 2015 6:38 PM, "Aaron Halfaker" ahalfaker@wikimedia.org wrote:
Christian, I appreciate your response, but if you only say how I misunderstood you without suggestion how I might have understood you better, I don't see a way to continue the conversation.
On Thu, Jan 15, 2015 at 2:48 PM, Christian Aistleitner < christian@quelltextlich.at> wrote:
Hi Aaron,
On Thu, Jan 15, 2015 at 09:23:13AM -0600, Aaron Halfaker wrote:
You hold a minority opinion that testing software in the field is unnecessary.
Hey, that's not what I've said :-)
And this mis-interpretation of my previous email pretty much makes the rest of your argument moot from my point of view.
We're merely disagreeing about whether it is good to assume that DNT means [...]
And I respect that we have different opinions about DNT. No doubt there.
However, when you say that my work has no value, [...]
??? Again ... I think you're misreading my email. I never said that your work has no value.
Have fun, Christian
-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian@quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
There may be room for a little nuance here. We can try to interpret DNT as a strict "do not collect anything" flag by default, and say from the start that it may be necessary to ignore it in some rare cases when we need access to data from IE10 users or similar. This creates a little extra work for us to communicate about those exceptional cases, but it may allow us to move forward here. I think I can safely assume we will recover any time lost in future communications by wrapping up this discussion sooner than later :)
To Leila's point, I think this solution would not cover apps and so we would still maybe want to move towards an all-encompassing opt-out feature. From the user point of view, "opt-out" could look consistent in web and apps. When the user clicks on the web version we could explain how DNT works in their browser and the special strict interpretation we use. On apps, we could have our own implementation similar to DNT. In both cases, I think we'd have a great opportunity to link to some landing page where people can find our research. Win win :)
On Friday, January 16, 2015, Aaron Halfaker ahalfaker@wikimedia.org wrote:
Ah. I think I see the confusion. When I referred to knowing whether the software "works" for a group of users or not, I'm talking about something more than technical requirements. Even software that is technically functioning can fail to serve its intended purpose. The work we do with field studies surfaces this. That's the point I was trying to make with the anon example.
What I find concerning is the idea that a biased subset of our users would be categorically ignored for this type of evaluation. If you agree with me that such evaluation is valuable to our users, I think you ought to also find such categorical exclusions concerning. On Jan 15, 2015 6:38 PM, "Aaron Halfaker" <ahalfaker@wikimedia.org javascript:_e(%7B%7D,'cvml','ahalfaker@wikimedia.org');> wrote:
Christian, I appreciate your response, but if you only say how I misunderstood you without suggestion how I might have understood you better, I don't see a way to continue the conversation.
On Thu, Jan 15, 2015 at 2:48 PM, Christian Aistleitner < christian@quelltextlich.at javascript:_e(%7B%7D,'cvml','christian@quelltextlich.at');> wrote:
Hi Aaron,
On Thu, Jan 15, 2015 at 09:23:13AM -0600, Aaron Halfaker wrote:
You hold a minority opinion that testing software in the field is unnecessary.
Hey, that's not what I've said :-)
And this mis-interpretation of my previous email pretty much makes the rest of your argument moot from my point of view.
We're merely disagreeing about whether it is good to assume that DNT means [...]
And I respect that we have different opinions about DNT. No doubt there.
However, when you say that my work has no value, [...]
??? Again ... I think you're misreading my email. I never said that your work has no value.
Have fun, Christian
-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian@quelltextlich.at javascript:_e(%7B%7D,'cvml','christian@quelltextlich.at'); 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/
Analytics mailing list Analytics@lists.wikimedia.org javascript:_e(%7B%7D,'cvml','Analytics@lists.wikimedia.org'); https://lists.wikimedia.org/mailman/listinfo/analytics
What I find concerning is the idea that a biased subset of our users would
be categorically ignored for this type of evaluation. If >you agree with me that such evaluation is valuable to our users, I think you ought to also find such categorical exclusions >concerning. Dan has mentioned a possible workarround which would be the obvious long standing practice of "do not pay attention what IE10 is saying" but Aaron please note that client side EL -as it is right now- excludes ALL browsers with faulty or non javascript support plus everyone with javascript turned of. This is certainly a "categorical exclusion" and one we know is present in our dataset from the very beginning. I can run the numbers but I wouldn't be surprised if this % of users is higher than the total of % IE10 users put together. However, I do not think is a problem we know data comes with this caveats and we know for example that EL might not be that useful when trying to see in detail behavior of users that use opera mini, say. (made up example)
On Thu, Jan 15, 2015 at 9:55 PM, Aaron Halfaker ahalfaker@wikimedia.org wrote:
Ah. I think I see the confusion. When I referred to knowing whether the software "works" for a group of users or not, I'm talking about something more than technical requirements. Even software that is technically functioning can fail to serve its intended purpose. The work we do with field studies surfaces this. That's the point I was trying to make with the anon example.
What I find concerning is the idea that a biased subset of our users would be categorically ignored for this type of evaluation. If you agree with me that such evaluation is valuable to our users, I think you ought to also find such categorical exclusions concerning. On Jan 15, 2015 6:38 PM, "Aaron Halfaker" ahalfaker@wikimedia.org wrote:
Christian, I appreciate your response, but if you only say how I misunderstood you without suggestion how I might have understood you better, I don't see a way to continue the conversation.
On Thu, Jan 15, 2015 at 2:48 PM, Christian Aistleitner < christian@quelltextlich.at> wrote:
Hi Aaron,
On Thu, Jan 15, 2015 at 09:23:13AM -0600, Aaron Halfaker wrote:
You hold a minority opinion that testing software in the field is unnecessary.
Hey, that's not what I've said :-)
And this mis-interpretation of my previous email pretty much makes the rest of your argument moot from my point of view.
We're merely disagreeing about whether it is good to assume that DNT means [...]
And I respect that we have different opinions about DNT. No doubt there.
However, when you say that my work has no value, [...]
??? Again ... I think you're misreading my email. I never said that your work has no value.
Have fun, Christian
-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian@quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
(sorry, send it too soon, re-sending)
What I find concerning is the idea that a biased subset of our users would
be categorically ignored for this type of evaluation. If >you agree with me that such evaluation is valuable to our users, I think you ought to also find such categorical exclusions >concerning. Dan has mentioned a possible workarround which would be the obvious long standing practice of "do not pay attention what IE10 is saying" but Aaron please note that client side EL -as it is right now- excludes ALL browsers with faulty or non javascript support plus everyone with javascript turned of. This is certainly a "categorical exclusion" and one we know is present in our dataset from the very beginning. I can run the numbers but I wouldn't be surprised if this % of users with no js-support is higher than the total of % IE10 users put together. More so with increasing mobile traffic. Now, it's a small number for sure, so as Leila mentions we can find the answer to many questions with out actually pooling every single user.
However I do not think this caveats are a problem, as they are know. We know, for example, that EL might not be that useful when trying to see in detail behavior of users that use, say, "opera mini" (made up example)
On Fri, Jan 16, 2015 at 9:10 AM, Nuria Ruiz nuria@wikimedia.org wrote:
What I find concerning is the idea that a biased subset of our users
would be categorically ignored for this type of evaluation. If >you agree with me that such evaluation is valuable to our users, I think you ought to also find such categorical exclusions >concerning. Dan has mentioned a possible workarround which would be the obvious long standing practice of "do not pay attention what IE10 is saying" but Aaron please note that client side EL -as it is right now- excludes ALL browsers with faulty or non javascript support plus everyone with javascript turned of. This is certainly a "categorical exclusion" and one we know is present in our dataset from the very beginning. I can run the numbers but I wouldn't be surprised if this % of users is higher than the total of % IE10 users put together. However, I do not think is a problem we know data comes with this caveats and we know for example that EL might not be that useful when trying to see in detail behavior of users that use opera mini, say. (made up example)
On Thu, Jan 15, 2015 at 9:55 PM, Aaron Halfaker ahalfaker@wikimedia.org wrote:
Ah. I think I see the confusion. When I referred to knowing whether the software "works" for a group of users or not, I'm talking about something more than technical requirements. Even software that is technically functioning can fail to serve its intended purpose. The work we do with field studies surfaces this. That's the point I was trying to make with the anon example.
What I find concerning is the idea that a biased subset of our users would be categorically ignored for this type of evaluation. If you agree with me that such evaluation is valuable to our users, I think you ought to also find such categorical exclusions concerning. On Jan 15, 2015 6:38 PM, "Aaron Halfaker" ahalfaker@wikimedia.org wrote:
Christian, I appreciate your response, but if you only say how I misunderstood you without suggestion how I might have understood you better, I don't see a way to continue the conversation.
On Thu, Jan 15, 2015 at 2:48 PM, Christian Aistleitner < christian@quelltextlich.at> wrote:
Hi Aaron,
On Thu, Jan 15, 2015 at 09:23:13AM -0600, Aaron Halfaker wrote:
You hold a minority opinion that testing software in the field is unnecessary.
Hey, that's not what I've said :-)
And this mis-interpretation of my previous email pretty much makes the rest of your argument moot from my point of view.
We're merely disagreeing about whether it is good to assume that DNT means [...]
And I respect that we have different opinions about DNT. No doubt there.
However, when you say that my work has no value, [...]
??? Again ... I think you're misreading my email. I never said that your work has no value.
Have fun, Christian
-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian@quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
On Thu, Jan 15, 2015 at 9:55 PM, Aaron Halfaker ahalfaker@wikimedia.org wrote:
What I find concerning is the idea that a biased subset of our users would be categorically ignored for this type of evaluation. If you agree with me that such evaluation is valuable to our users, I think you ought to also find such categorical exclusions concerning
(In the e-mail below I sometimes use "we" to mean "Wikimedians" and sometimes to mean "Wikimedia Foundation employees". I am aware that this is a public discussion and that not all participants are employees of the Foundation. Hopefully the context will make my meaning clear.)
Aaron's point is valid. If we collect any data at all, we are morally obligated to do so in a way that can actually support rigorous research on questions of broad value to the community and humanity as a whole. Collecting data in a manner that we know cannot support serious research is morally obnoxious and it invalidates the mandate we claim to collect any data at all.
That said, I am not convinced that adopting a strong interpretation of DNT (and acting on it) would substantially compromise our ability to do research. The bias that it potentially introduces is of comparable magnitude to the risks of bias that scientists routinely accept in the interest of meeting ethical standards and respecting the rights of individuals. The fact that participation in drug trials is voluntary and that the compensation (when there is any) is usually fixed at a set amount is a good example.
I also think that our ability to conduct research would be compromised far more substantially were we to lose the confidence of our users. The only hope we have of gaining an understanding of Wikimedia is (in my opinion) through peer collaboration with our community. The question of whether we (Foundation employees) will be able to support a broad community of inquiry has much higher stakes than whether or not our data is fully representative of all user-agents.
The fact that there is no strong legal requirement forcing our hand here and that weaker interpretations of the header are defensible and plausible means that there is an opportunity here to be lead by example and to send a strong message to our community and to the internet at large about our values and our commitment to our users. It's an opportunity I think we should take.
Ori,
I agree on all points. My assertions are this:
1. DNT means 3rd party tracking. It's in the definition. 2. However, we'd like to have a strict interpretation and act beyond the definition. This empowers our users and sets a good precedent. 3. The categorical exclusion of a substantial set of our users from field studies is concerning and can cause problem.
Though Nuria pointed out that DNT/IE10 is not the only potential categorical exclusion, that does not reduce the problem. If we can can confirm that this won't cause a substantial issue or implement a strategy to make sure it does not, then this won't be a problem.
-Aaron
On Fri, Jan 16, 2015 at 1:42 PM, Ori Livneh ori@wikimedia.org wrote:
On Thu, Jan 15, 2015 at 9:55 PM, Aaron Halfaker ahalfaker@wikimedia.org wrote:
What I find concerning is the idea that a biased subset of our users would be categorically ignored for this type of evaluation. If you agree with me that such evaluation is valuable to our users, I think you ought to also find such categorical exclusions concerning
(In the e-mail below I sometimes use "we" to mean "Wikimedians" and sometimes to mean "Wikimedia Foundation employees". I am aware that this is a public discussion and that not all participants are employees of the Foundation. Hopefully the context will make my meaning clear.)
Aaron's point is valid. If we collect any data at all, we are morally obligated to do so in a way that can actually support rigorous research on questions of broad value to the community and humanity as a whole. Collecting data in a manner that we know cannot support serious research is morally obnoxious and it invalidates the mandate we claim to collect any data at all.
That said, I am not convinced that adopting a strong interpretation of DNT (and acting on it) would substantially compromise our ability to do research. The bias that it potentially introduces is of comparable magnitude to the risks of bias that scientists routinely accept in the interest of meeting ethical standards and respecting the rights of individuals. The fact that participation in drug trials is voluntary and that the compensation (when there is any) is usually fixed at a set amount is a good example.
I also think that our ability to conduct research would be compromised far more substantially were we to lose the confidence of our users. The only hope we have of gaining an understanding of Wikimedia is (in my opinion) through peer collaboration with our community. The question of whether we (Foundation employees) will be able to support a broad community of inquiry has much higher stakes than whether or not our data is fully representative of all user-agents.
The fact that there is no strong legal requirement forcing our hand here and that weaker interpretations of the header are defensible and plausible means that there is an opportunity here to be lead by example and to send a strong message to our community and to the internet at large about our values and our commitment to our users. It's an opportunity I think we should take.
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
I second Aaron’s concerns, which I previously expressed during the consultation about the new privacy policy. My main objection to the proposed solution is that by saying “Wikimedia honors DNT headers” we imply – by the most popular/de facto interpretation of DNT – that we do 3rd party tracking but we allow users to opt out, which puts WMF on par with aggressive tracking practices adopted by most sites.
I’d rather focus on a clean and transparent implementation of an opt out mechanism that doesn’t create confusion, gives the user a clear understanding of what s/he is opting out from instead of piggybacking on DNT.
I too am worried of the impact of the exclusion of a segment of the user population from (aggregate) measurements that we obtain via instrumentation and that we use to assess the impact of Product changes, but I’m ready to push the discussion of what is an acceptable tradeoff to our customers (the community and decision-makers at WMF). It’s also worth reminding that all data collected via EventLogging that contains PII such as IP addresses or raw UserAgents is subject to our data retention guidelines. [1]
Dario
[1] https://meta.wikimedia.org/wiki/Data_retention_guidelines https://meta.wikimedia.org/wiki/Data_retention_guidelines
On Jan 16, 2015, at 1:29 PM, Aaron Halfaker ahalfaker@wikimedia.org wrote:
Ori,
I agree on all points. My assertions are this: DNT means 3rd party tracking. It's in the definition. However, we'd like to have a strict interpretation and act beyond the definition. This empowers our users and sets a good precedent. The categorical exclusion of a substantial set of our users from field studies is concerning and can cause problem. Though Nuria pointed out that DNT/IE10 is not the only potential categorical exclusion, that does not reduce the problem. If we can can confirm that this won't cause a substantial issue or implement a strategy to make sure it does not, then this won't be a problem.
-Aaron
On Fri, Jan 16, 2015 at 1:42 PM, Ori Livneh <ori@wikimedia.org mailto:ori@wikimedia.org> wrote:
On Thu, Jan 15, 2015 at 9:55 PM, Aaron Halfaker <ahalfaker@wikimedia.org mailto:ahalfaker@wikimedia.org> wrote: What I find concerning is the idea that a biased subset of our users would be categorically ignored for this type of evaluation. If you agree with me that such evaluation is valuable to our users, I think you ought to also find such categorical exclusions concerning
(In the e-mail below I sometimes use "we" to mean "Wikimedians" and sometimes to mean "Wikimedia Foundation employees". I am aware that this is a public discussion and that not all participants are employees of the Foundation. Hopefully the context will make my meaning clear.)
Aaron's point is valid. If we collect any data at all, we are morally obligated to do so in a way that can actually support rigorous research on questions of broad value to the community and humanity as a whole. Collecting data in a manner that we know cannot support serious research is morally obnoxious and it invalidates the mandate we claim to collect any data at all.
That said, I am not convinced that adopting a strong interpretation of DNT (and acting on it) would substantially compromise our ability to do research. The bias that it potentially introduces is of comparable magnitude to the risks of bias that scientists routinely accept in the interest of meeting ethical standards and respecting the rights of individuals. The fact that participation in drug trials is voluntary and that the compensation (when there is any) is usually fixed at a set amount is a good example.
I also think that our ability to conduct research would be compromised far more substantially were we to lose the confidence of our users. The only hope we have of gaining an understanding of Wikimedia is (in my opinion) through peer collaboration with our community. The question of whether we (Foundation employees) will be able to support a broad community of inquiry has much higher stakes than whether or not our data is fully representative of all user-agents.
The fact that there is no strong legal requirement forcing our hand here and that weaker interpretations of the header are defensible and plausible means that there is an opportunity here to be lead by example and to send a strong message to our community and to the internet at large about our values and our commitment to our users. It's an opportunity I think we should take.
Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
On Fri, Jan 16, 2015 at 4:25 PM, Dario Taraborelli < dtaraborelli@wikimedia.org> wrote:
I second Aaron’s concerns, which I previously expressed during the consultation about the new privacy policy. My main objection to the proposed solution is that by saying “Wikimedia honors DNT headers” we imply – by the most popular/de facto interpretation of DNT – that we *do* 3rd party tracking but we* allow* users to opt out, which puts WMF on par with aggressive tracking practices adopted by most sites.
But we wouldn't say that; that would be silly. We'd make it completely clear that we are making use of the header that we think is consistent with the expectation of users but which departs from the standard in a significant way.
On Fri, Jan 16, 2015 at 4:56 PM, Ori Livneh ori@wikimedia.org wrote:
On Fri, Jan 16, 2015 at 4:25 PM, Dario Taraborelli < dtaraborelli@wikimedia.org> wrote:
I second Aaron’s concerns, which I previously expressed during the consultation about the new privacy policy. My main objection to the proposed solution is that by saying “Wikimedia honors DNT headers” we imply – by the most popular/de facto interpretation of DNT – that we *do* 3rd party tracking but we* allow* users to opt out, which puts WMF on par with aggressive tracking practices adopted by most sites.
But we wouldn't say that; that would be silly. We'd make it completely clear that we are making use of the header that we think is consistent with the expectation of users but which departs from the standard in a significant way.
I, too, agree that this is something we (Comms) can handle through proper communications. It's not a big concern for me.
Leila
On Fri, Jan 16, 2015 at 4:25 PM, Dario Taraborelli < dtaraborelli@wikimedia.org> wrote:
I second Aaron’s concerns, which I previously expressed during the consultation about the new privacy policy. My main objection to the proposed solution is that by saying “Wikimedia honors DNT headers” we imply – by the most popular/de facto interpretation of DNT – that we *do* 3rd party tracking but we* allow* users to opt out, which puts WMF on par with aggressive tracking practices adopted by most sites.
But we wouldn't say that; that would be silly. We'd make it completely
clear that we are making use of the header that we think >is consistent with the expectation of users but which departs from the standard in a significant way.
+1 to Leila and Ori I do not think this is as issue that cannot be solved with a simple FAQ.
On Fri, Jan 16, 2015 at 5:07 PM, Leila Zia leila@wikimedia.org wrote:
On Fri, Jan 16, 2015 at 4:56 PM, Ori Livneh ori@wikimedia.org wrote:
On Fri, Jan 16, 2015 at 4:25 PM, Dario Taraborelli < dtaraborelli@wikimedia.org> wrote:
I second Aaron’s concerns, which I previously expressed during the consultation about the new privacy policy. My main objection to the proposed solution is that by saying “Wikimedia honors DNT headers” we imply – by the most popular/de facto interpretation of DNT – that we *do* 3rd party tracking but we* allow* users to opt out, which puts WMF on par with aggressive tracking practices adopted by most sites.
But we wouldn't say that; that would be silly. We'd make it completely clear that we are making use of the header that we think is consistent with the expectation of users but which departs from the standard in a significant way.
I, too, agree that this is something we (Comms) can handle through proper communications. It's not a big concern for me.
Leila
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Ori,
we are making use of the header that we think is consistent with the expectation of users
based on what evidence?
I’ve seen a single reference cited in this thread pointing to a study that candidly declares in its abstract:
“Because Do Not Track is so new, as far as we know this is the first scholarship on this topic. This paper has been neither presented nor published. “ [1]
The ample and representative sample considered by the EFF is well captured at the beginning of this statement:
“Intuitively, users who we’ve talked to want Do Not Track to provide meaningful limits on collection and retention of data.”
Nobody is questioning the need to be transparent to our users about what data we’re collecting, how long this data is retained and what it’s being used for. But I see a thread full of handwaving statements about “what users really want”, in contrast to a pretty straightforward truth that nobody who participated in this thread would challenge:
which departs from the standard in a significant way.
I don’t see myself blessing a proposal that represents “a significant departure from the standard” and I’d love to see more substantial evidence on user expectations to justify this.
Dario
[1] http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1993133 http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1993133
I didn’t reference the McDonald study in my reply, but I too am not particularly persuaded by the conclusions.
“Many think it means they will not be tracked at all, including collection”
suggests to me a fundamental lack of literacy among the users surveyed about what data that browsers pass with HTTP requests.
On Jan 16, 2015, at 7:54 PM, Dario Taraborelli dario@wikimedia.org wrote:
Ori,
we are making use of the header that we think is consistent with the expectation of users
based on what evidence?
I’ve seen a single reference cited in this thread pointing to a study that candidly declares in its abstract:
“Because Do Not Track is so new, as far as we know this is the first scholarship on this topic. This paper has been neither presented nor published. “ [1]
The ample and representative sample considered by the EFF is well captured at the beginning of this statement:
“Intuitively, users who we’ve talked to want Do Not Track to provide meaningful limits on collection and retention of data.”
Nobody is questioning the need to be transparent to our users about what data we’re collecting, how long this data is retained and what it’s being used for. But I see a thread full of handwaving statements about “what users really want”, in contrast to a pretty straightforward truth that nobody who participated in this thread would challenge:
which departs from the standard in a significant way.
I don’t see myself blessing a proposal that represents “a significant departure from the standard” and I’d love to see more substantial evidence on user expectations to justify this.
Dario
[1] http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1993133 http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1993133
I’m searching for references looking at user perception of third-party behavioral tracking vs logging, any pointer would be appreciated.
On Jan 16, 2015, at 8:16 PM, Dario Taraborelli dtaraborelli@wikimedia.org wrote:
I didn’t reference the McDonald study in my reply, but I too am not particularly persuaded by the conclusions.
“Many think it means they will not be tracked at all, including collection”
suggests to me a fundamental lack of literacy among the users surveyed about what data that browsers pass with HTTP requests.
On Jan 16, 2015, at 7:54 PM, Dario Taraborelli <dario@wikimedia.org mailto:dario@wikimedia.org> wrote:
Ori,
we are making use of the header that we think is consistent with the expectation of users
based on what evidence?
I’ve seen a single reference cited in this thread pointing to a study that candidly declares in its abstract:
“Because Do Not Track is so new, as far as we know this is the first scholarship on this topic. This paper has been neither presented nor published. “ [1]
The ample and representative sample considered by the EFF is well captured at the beginning of this statement:
“Intuitively, users who we’ve talked to want Do Not Track to provide meaningful limits on collection and retention of data.”
Nobody is questioning the need to be transparent to our users about what data we’re collecting, how long this data is retained and what it’s being used for. But I see a thread full of handwaving statements about “what users really want”, in contrast to a pretty straightforward truth that nobody who participated in this thread would challenge:
which departs from the standard in a significant way.
I don’t see myself blessing a proposal that represents “a significant departure from the standard” and I’d love to see more substantial evidence on user expectations to justify this.
Dario
[1] http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1993133 http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1993133
suggests to me a fundamental lack of literacy among the users surveyed about what data that browsers pass with HTTP requests.
Of course. Do you expect the average user to know what http is or what is a 'request'? Likely not, it is a technical topic that you do not need to be familiar with to effectively use your smartphone/computer.
In more practical terms and looking at the goals of our team for next quarter we do not have any near term plans to develop an specific opt-out ui/ux and standard in the next 3 months thus I think proceeding with DNT as an intermediate solution as outlined by leila earlier on in the thread is aceptable. Like anything we do we can iterate, change, expand as needed in the future.
On Jan 16, 2015, at 8:43 PM, Dario Taraborelli dtaraborelli@wikimedia.org wrote:
suggests to me a fundamental lack of literacy among the users surveyed about what data that browsers pass with HTTP requests.
Hi Aaron,
On Thu, Jan 15, 2015 at 06:38:40PM -0600, Aaron Halfaker wrote:
[...] if you only say how I misunderstood you without suggestion how I might have understood you better, [...]
This was on purpose. The thread (especially the non-public part) got too emotional/heated, and people complained that they're getting too much email.
Elaborating on my reasoning and where our points of view differ would only add fuel to the fire. This is not necessary.
However, I felt strong need to at least clarify that I think you misread my email. Otherwise you'd think that I did not like the work you're doing. That would be unjust.
Have fun, Christian
Christian Aistleitner, 15/01/2015 14:22:
If WMF's main form of QA was through automated usage data collection, you'd have a point.
Aaron Halfaker, 15/01/2015 16:23:
You hold a minority opinion that testing software in the field is unnecessary. [...]
Allow me to share a specific example. In this study[1], we found that telling anonymous editors to register dropped their productivity by *25%.*
IMHO you two are actually saying the same thing on this specific example, and I agree with both. Not collecting data about IE10 users would not increase the number of *bugs* they experience. Research about our users gives important results: a finding like this would not be impeded by ignoring IE10.
At a more general level, the conclusion is harder.
Nemo
On Wed, Jan 14, 2015 at 10:07 AM, Aaron Halfaker ahalfaker@wikimedia.org wrote:
For example, not collecting usage data about certain sections of our population (e.g. IE10 users where DNT is set by default) means that we don't know if our software works for them.
Note that IE 10 does inform you about the option and that it provides a means for toggling it: https://skitch-img.s3.amazonaws.com/20120907-tu8m3msxk7sngf61188f8hxm7a.png
This is subtle, but still significant. It was cited by many participants in the debate about whether or not to honor IE10's DNT in Apache:
https://github.com/apache/httpd/commit/a381ff35fa4d50a5f7b9f64300dfd98859dee... https://issues.apache.org/bugzilla/show_bug.cgi?id=53845
Cf. * https://meta.wikimedia.org/wiki/Talk:Privacy_policy/Archives/2013#Do_Not_Tra... * https://meta.wikimedia.org/wiki/Talk:Privacy_policy/Archives/2014#Changing_D... * https://meta.wikimedia.org/wiki/Talk:Requests_for_comment/User_site_behavior...
(This thread, when it's over, should probably be summarised in that RfC.)
Nemo
Fair enough - I don't use it, and I think I'd got entirely the wrong end of the stick on what it's for! If it's intended to stop tracking by third-party sites then it certainly seems to be of little relevance here.
(It might be worth clarifying this in the proposal, in case a future ethics-committee reviewer gets the same misapprehension?)
Andrew.
On 13 January 2015 at 20:24, Aaron Halfaker ahalfaker@wikimedia.org wrote:
Andrew,
I think it is reasonable to assume that the "Do not track" header isn't referring to this.
From http://donottrack.us/ with emphasis added.
Do Not Track is a technology and policy proposal that enables users to opt out of tracking by websites they do not visit, [...]
Do not track is explicitly for third party tracking. We are merely proposing to count those people who do access our sites. Note that, in this case, we are not interested in obtaining identifiers at all, so the word "track" seems to not apply.
It seems like we're looking for something like a "Do Not Log Anything At All" header. I don't believe that such a thing exists -- but if it did I think it would be good if we supported it.
-Aaron
On Tue, Jan 13, 2015 at 2:03 PM, Andrew Gray andrew.gray@dunelm.org.uk wrote:
Hi Dario, Reid,
This seems sensible enough and proposal #3 is clearly the better approach. An explicit opt-in opt-out mechanism would not be worth the effort to build and would become yet another ignored preferences setting after a few weeks...
A couple of thoughts:
- I understand the reasoning for not using do-not-track headers (#4);
however, it feels a bit odd to say "they probably don't mean us" and skip them... I can almost guarantee you'll have at least one person making a vocal fuss about not being able to opt-out without an account. If we were to honour these headers, would it make a significant change to the amount of data available? Would it likely skew it any more than leaving off logged-in users?
- Option 3 does releases one further piece of information over and
above those listed - an approximate ratio of logged in versus non-logged-in pageviews for a page. I cannot see any particular problem with doing this (and I can think of a couple of fun things to use it for) but it's probably worth being aware.
Andrew.
On 13 January 2015 at 07:26, Dario Taraborelli dtaraborelli@wikimedia.org wrote:
I’m sharing a proposal that Reid Priedhorsky and his collaborators at Los Alamos National Laboratory recently submitted to the Wikimedia Analytics Team aimed at producing privacy-preserving geo-aggregates of Wikipedia pageview data dumps and making them available to the public and the research community. [1]
Reid and his team spearheaded the use of the public Wikipedia pageview dumps to monitor and forecast the spread of influenza and other diseases, using language as a proxy for location. This proposal describes an aggregation strategy adding a geographical dimension to the existing dumps.
Feedback on the proposal is welcome on the lists or the project talk page on Meta [3]
Dario
[1] https://meta.wikimedia.org/wiki/Research:Geo-aggregation_of_Wikipedia_pagevi... [2] http://dx.doi.org/10.1371/journal.pcbi.1003892 [3] https://meta.wikimedia.org/wiki/Research_talk:Geo-aggregation_of_Wikipedia_p... _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
--
- Andrew Gray andrew.gray@dunelm.org.uk
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
On Wed, Jan 14, 2015 at 9:22 AM, Andrew Gray andrew.gray@dunelm.org.uk wrote:
Fair enough - I don't use it, and I think I'd got entirely the wrong end of the stick on what it's for! If it's intended to stop tracking by third-party sites then it certainly seems to be of little relevance here.
I think you're right to be concerned about this. It is about expectations; people do not expect a NGO providing an encyclopedia to be silently capturing reading behaviour data.
If the data is provided to other entities, even for noble research objectives, people expect "Do Not Track" to cover this.
https://cyberlaw.stanford.edu/node/6573
I'm confused; john, could you point to the element of the collected data that isn't collected already by default in any Nginx or Apache setup? I agree that there might be a lack of user expectation, but 'silently capturing behavioral data' seems somewhat hyperbolic to describe what's actually going on.
On Tuesday, 13 January 2015, John Mark Vandenberg jayvdb@gmail.com wrote:
On Wed, Jan 14, 2015 at 9:22 AM, Andrew Gray <andrew.gray@dunelm.org.uk javascript:;> wrote:
Fair enough - I don't use it, and I think I'd got entirely the wrong end of the stick on what it's for! If it's intended to stop tracking by third-party sites then it certainly seems to be of little relevance here.
I think you're right to be concerned about this. It is about expectations; people do not expect a NGO providing an encyclopedia to be silently capturing reading behaviour data.
If the data is provided to other entities, even for noble research objectives, people expect "Do Not Track" to cover this.
https://cyberlaw.stanford.edu/node/6573
-- John Vandenberg
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org javascript:; https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
On Wed, Jan 14, 2015 at 2:25 PM, Oliver Keyes ironholds@gmail.com wrote:
I'm confused; john, could you point to the element of the collected data that isn't collected already by default in any Nginx or Apache setup? I agree that there might be a lack of user expectation, but 'silently capturing behavioral data' seems somewhat hyperbolic to describe what's actually going on.
The proposed element to be added is geolocation below country level. Default Nginx and Apache log formats do not include geolocation. Which is why this research proposal exists and is being discussed, and rightly so.
fwiw, the Nginx geoip module is not even included, by default, when compiling the source code.
As the paper explicitly describes, and is a common theme in research proposals, Wikimedia access log information is user reading behaviour being captured.
The old privacy and data retention policies gave users the expectation that access log data was destroyed after a set period, assumed to be only three months as that was the limit of Checkuser visibility. The current policies are more like "yes we collect a lot of data about users, using tracking technology, and please trust us." And "sorry we dont honour 'Dont track us', as we presumed that you trust us and the researchers that we allow to access our analytics."
We should be planning for what will be the effect when the WMF servers are hacked and _all_ of the analytics data is now in the hands of a repressive government or similar. Or, imagine the WMF sends the analytics data across an insecure link which is tapped and the data reconstructed, either due to not using secure links at all, or an accidental routing problem. https://lists.wikimedia.org/pipermail/wikimedia-l/2013-December/129357.html
If/When that day comes, hopefully they don't have much data to make inferences from, and what data they obtain can be well justified.
Having a quick peak, I thought it was odd that browser Wikimedia sites now causes impressions to be sent back to the WMF servers with the country of the user included. "This is a workaround to simplify analytics." https://git.wikimedia.org/blob/mediawiki%2Fextensions%2FCentralNotice/8ee877...
The more you collect, especially using multiple systems to collect similar data, the more likely that if subpoenaed, WMF's various datasets could be used to infer a pretty reliable answer to "which days in 2013 was John Vandenberg in Indonesia?", or "when did John Vandenberg first read the Wikipedia article about <bomb making ingredient>?" The more you publish, even aggregated, the more likely these types of questions can be inferred without a subpoena, at least for users with large enough lists of public contributions, by scientists like yourself with lots of computation power and plenty of time on their hands rifling through the data to *infer* the identify of editors, and if it is a government body they also have lots of other datasets which can be used to assist in the task.
Adding fine-grained geolocation information to published page views is an example of the latter and the paper wisely suggests not including logged in users as a possible solution to some of the privacy issues.
There is also the problem that many IPs can be easily inferred to be a single cohort of people in some situations. e.g. in regions where the only large collection of computers is an single facility, e.g. a school. In a repressive regime especially, that could lead to official questions being asked like: why were so many students at this school reading about <blah> on <date>. And teachers being identified as responsible, etc.
The paper considers IP users vs logged in users to be a binary set. However there are tools built which exploit the fact that logged in users make a logged out edit which identifies their IP. Add geolocation of pageviews and we can infer the probability that other IPs in their smallest geolocation block are also likely to be edits by the same person, as the algorithm in the paper leaks 'number of active editors in each region each day'.
The purpose of this proposed change in analytics is summarised in the paper:
'In short, the current global aggregation of Wikipedia page view is unsuitable for an operational disease monitoring system. There will be no “Wikipedia Flu Trends” unless page view data are aggregated at a finer geographic scale.'
If "Wikipedia Flu Trends" is the justification, we had better be certain that detecting Flu Trends using Wikipedia is going to be the most effective method, and isn't just an academically interesting exercise. A limited trial to determine utility would be helpful to establish if "Wikipedia Flu Trends" is a viable world health solution worthy of justifying additional data retention and publishing of aggregates.
Is there a minimum threshold at which views of a page mean it becomes 'interesting' to analyse using finer grained geographic data. I suspect that pages with only hundreds of page views per day are not particularly useful for "Wikipedia Flu Trends".
Also does "Wikipedia Flu Trends" need to have access to geographically tagged page view data of, say, me reading http://en.wiktionary.org/wiki/bota today? Is there a way to restrict which types of pages are tracked at finer geographic granularity without adversely affecting the "Wikipedia Flu Trends" graph.
-- John Vandenberg
On Wed, Jan 14, 2015 at 3:39 AM, John Mark Vandenberg jayvdb@gmail.com wrote:
On Wed, Jan 14, 2015 at 2:25 PM, Oliver Keyes ironholds@gmail.com wrote:
I'm confused; john, could you point to the element of the collected data that isn't collected already by default in any Nginx or Apache setup? I agree that there might be a lack of user expectation, but 'silently capturing behavioral data' seems somewhat hyperbolic to describe what's actually going on.
The proposed element to be added is geolocation below country level. Default Nginx and Apache log formats do not include geolocation. Which is why this research proposal exists and is being discussed, and rightly so.
Gotcha: I thought you were referring to the information we already have.
fwiw, the Nginx geoip module is not even included, by default, when compiling the source code.
As the paper explicitly describes, and is a common theme in research proposals, Wikimedia access log information is user reading behaviour being captured.
The old privacy and data retention policies gave users the expectation that access log data was destroyed after a set period, assumed to be only three months as that was the limit of Checkuser visibility. The current policies are more like "yes we collect a lot of data about users, using tracking technology, and please trust us." And "sorry we dont honour 'Dont track us', as we presumed that you trust us and the researchers that we allow to access our analytics."
We should be planning for what will be the effect when the WMF servers are hacked and _all_ of the analytics data is now in the hands of a repressive government or similar. Or, imagine the WMF sends the analytics data across an insecure link which is tapped and the data reconstructed, either due to not using secure links at all, or an accidental routing problem. https://lists.wikimedia.org/pipermail/wikimedia-l/2013-December/129357.html
The geolocation proposal is to perform it over IP addresses...which are already stored. So, the only major difference between "hacking" now and "hacking" later is that doing it later means you don't have to spend 99 bucks on a geolocation hashtable.
If/When that day comes, hopefully they don't have much data to make inferences from, and what data they obtain can be well justified.
Having a quick peak, I thought it was odd that browser Wikimedia sites now causes impressions to be sent back to the WMF servers with the country of the user included. "This is a workaround to simplify analytics." https://git.wikimedia.org/blob/mediawiki%2Fextensions%2FCentralNotice/8ee877...
CentralNotice and the fundraising banners have done this for absolutely years, yes; that's the code you're looking at.
The more you collect, especially using multiple systems to collect similar data, the more likely that if subpoenaed, WMF's various datasets could be used to infer a pretty reliable answer to "which days in 2013 was John Vandenberg in Indonesia?", or "when did John Vandenberg first read the Wikipedia article about <bomb making ingredient>?" The more you publish, even aggregated, the more likely these types of questions can be inferred without a subpoena, at least for users with large enough lists of public contributions, by scientists like yourself with lots of computation power and plenty of time on their hands rifling through the data to *infer* the identify of editors, and if it is a government body they also have lots of other datasets which can be used to assist in the task.
Yep, and that's why we're discussing this.
Adding fine-grained geolocation information to published page views is an example of the latter and the paper wisely suggests not including logged in users as a possible solution to some of the privacy issues.
There is also the problem that many IPs can be easily inferred to be a single cohort of people in some situations. e.g. in regions where the only large collection of computers is an single facility, e.g. a school. In a repressive regime especially, that could lead to official questions being asked like: why were so many students at this school reading about <blah> on <date>. And teachers being identified as responsible, etc.
The paper considers IP users vs logged in users to be a binary set. However there are tools built which exploit the fact that logged in users make a logged out edit which identifies their IP. Add geolocation of pageviews and we can infer the probability that other IPs in their smallest geolocation block are also likely to be edits by the same person, as the algorithm in the paper leaks 'number of active editors in each region each day'.
No, it doesn't: the proposal is to aggregate. Where there are few observations (or little variation in observations) within a geographic region, the data will be moved up one level and aggregated, and so on until a sufficient degree of fuzziness is reached. This is the very basis of k- and i-anonymity.
The purpose of this proposed change in analytics is summarised in the paper:
'In short, the current global aggregation of Wikipedia page view is unsuitable for an operational disease monitoring system. There will be no “Wikipedia Flu Trends” unless page view data are aggregated at a finer geographic scale.'
If "Wikipedia Flu Trends" is the justification, we had better be certain that detecting Flu Trends using Wikipedia is going to be the most effective method, and isn't just an academically interesting exercise. A limited trial to determine utility would be helpful to establish if "Wikipedia Flu Trends" is a viable world health solution worthy of justifying additional data retention and publishing of aggregates.
Is there a minimum threshold at which views of a page mean it becomes 'interesting' to analyse using finer grained geographic data. I suspect that pages with only hundreds of page views per day are not particularly useful for "Wikipedia Flu Trends".
Also does "Wikipedia Flu Trends" need to have access to geographically tagged page view data of, say, me reading http://en.wiktionary.org/wiki/bota today? Is there a way to restrict which types of pages are tracked at finer geographic granularity without adversely affecting the "Wikipedia Flu Trends" graph.
-- John Vandenberg
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Hi folks,
Reviving an old thread (my apologies for the delay). I’ve looked over this thread, the talk page linked below, and a few other places that seemed like they might have feedback for us.
It seemed to me that key feedback, in addition to some technical suggestions, was:
* Ratio of logged in to logged out readers can be inferred. * Think more carefully about whether reading patterns can be inferred for anonymous editors. * How to interpret the Do-Not-Track header is controversial.
As for DNT, my main concern from the research perspective is, would interpreting DNT as exclusion from geo-aggregation reduce the sample size excessively. Luis Villa’s link for Firefox numbers shows a peak of 11% in March 2013, declining to 8% at the end of the data in September 2014, for desktop version, with a 17% peak in July 2012 and a similar decline to 5% in September 2014 for mobile users. With these types of numbers, I believe the larger sample (i.e., DNT hits included in geo-aggregation) will indeed support somewhat more robust results, but the smaller sample (exclude DNT) is fine. I worry some about growth, but as long as it’s not the default, that’s probably not a major concern.
One thing that I would really like feedback on is: what is an acceptable k — i.e., how large is the set of users from whom a specific user is indistinguishable? I believe this will have a significantly greater impact on the quality of our results than DNT.
Please let me know if I’ve missed anything. I’d like to rev the proposal soon, and I’d like to make it responsive to what the community thinks.
Thanks, Reid
[Just to be absolutely clear, I’m speaking for myself, not my employer.]
On 13 January 2015 at 07:26, Dario Taraborelli <dtaraborelli@wikimedia.orgmailto:dtaraborelli@wikimedia.org> wrote:
I’m sharing a proposal that Reid Priedhorsky and his collaborators at Los Alamos National Laboratory recently submitted to the Wikimedia Analytics Team aimed at producing privacy-preserving geo-aggregates of Wikipedia pageview data dumps and making them available to the public and the research community. [1]
Reid and his team spearheaded the use of the public Wikipedia pageview dumps to monitor and forecast the spread of influenza and other diseases, using language as a proxy for location. This proposal describes an aggregation strategy adding a geographical dimension to the existing dumps.
Feedback on the proposal is welcome on the lists or the project talk page on Meta [3]
Dario
[1] https://meta.wikimedia.org/wiki/Research:Geo-aggregation_of_Wikipedia_pagevi... [2] http://dx.doi.org/10.1371/journal.pcbi.1003892 [3] https://meta.wikimedia.org/wiki/Research_talk:Geo-aggregation_of_Wikipedia_p...