Forking since I think there are two conversations - one about the format of UA for the mobile apps and one about CheckUser requirements for anything that does edits. Having them separate would be useful.
For those who do not know what CheckUser means, I recommend reading https://en.wikipedia.org/wiki/Wikipedia:CheckUser.
IP address and UA are amongst the two most important pieces of info CUs have in helping prevent abuse. IP is already sortof useless with mobile networks - a lot of providers do NAT and similar things that mean that we can not remotely close to reasonably assume 1 IP = 1 User, or anything remotely similar to that. UA provides more fingerprinting ability, but CU isn't the only thing that consumes UA - other parts of the infrastructure do as well.
So what we need, is a way to preserve the ability to fingerprint only users making edits (no read actions!) for CU. I am sure that can be implemented without having to have a very fingerprintable UA, with simple hooks on both the App's side and on Extension:CheckUser.
We could generate a simple fingerprint that's unique per device (and disconnected completely from every other device identifier) that we send only with edits (and other 'POST' actions) as a separate header. This can be processed by CU (perhaps with a hook that Extension:MobileApp can hook into) and then used by CheckUsers. This data will be treated with the same data retention / privacy policy that applies to CUs now, and regular UA data can be consumed by other consumers without too much fingerprinting concerns.
I talked to hoo and he said the CU hook shouldn't be too much of a problem, and the app side of the issue is rather simple too. Deskana (speaking solely as a volunteer CU) says that this solution is acceptable to him. Thoughts other people?
On Thu, Mar 27, 2014 at 10:43 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Repost, because filtering; there might be a point of confusion here that's causing the problem. As I understand it, the user agent sanitisation is expected to apply to EventLogging data, and data in the Analytics pipeline, but not data streaming into MediaWiki proper - namely, the cu_changes table. Nuria, is that the case?
On 27 March 2014 08:16, Nuria Ruiz nuria@wikimedia.org wrote:
Rather than having an ethical debate over it, we could always test the actual usefulness with Science. That way we'd be able to see how much granularity each additional component adds to the data.
I kind of feel we are going backwards as we throughly discussed this point, technical info and references regarding entropy and user agents and fingerprinting can be found here: https://www.mediawiki.org/wiki/EventLogging/UserAgentSanitization
On Thu, Mar 27, 2014 at 3:49 PM, Oliver Keyes okeyes@wikimedia.org wrote:
+1. I'm totally down for keeping less information around, but if it gets in the way of people doing their job?
Rather than having an ethical debate over it, we could always test the actual usefulness with Science. That way we'd be able to see how much granularity each additional component adds to the data.
On 27 March 2014 07:15, Aaron Halfaker ahalfaker@wikimedia.org wrote:
Including more information on the UA, while being covered by legal under the new privacy policy, really goes agains the wishes of the community as they do not wish to be finger printed.
I don't think that "the wishes of the community" have been established and the whole point of checkuser is that it allows for fingerprinting.
On Thu, Mar 27, 2014 at 4:20 AM, Nuria Ruiz nuria@wikimedia.org wrote:
As a checkuser, user agents are an important part of my workflow for identifying that multiple accounts are owned by the same person. So I'm going to have to argue for including more information in the user agent.
Including more information on the UA, while being covered by legal under the new privacy policy, really goes agains the wishes of the community as they do not wish to be finger printed. See: https://www.mediawiki.org/wiki/Talk:EventLogging/UserAgentSanitization or https://meta.wikimedia.org/wiki/Talk:Privacy_policy There has been plenty more discussions about this on analytics e-mail list.
Your proposed user agent would basically mean that every single person using the most up-to-date version of the app on a particular platform would >be indistinguishable from each other. This would, unfortunately, lead to lots of innocent users getting blocked as sockpuppets.
However, note that the UA " WikipediaApp/<version> <OS>/<form-factor>/<version>" clearly satisfies the use case of the mobile team. It provides as much information as they need from their user without sending any private data.
Can you please list what is your use case? Namely how are you identifying "false" accounts. Perhaps relying on the user agent to do so is not the best strategy going forward. Have in mind that with the old privacy policy UA data needed to be discarded after 90 days. With the new policy there is more legal room but given community feedback analytics team is planning on aggregating all UA information in the future. This means that UA data will not be stored (or reported) per user or request but rather agreggated (as in "4% of users use iPhone").
We gathered recently information from all teams as to use cases pertaining UA data collection:
https://office.wikimedia.org/wiki/Analytics/Internal/EventLogging/PrivateDat....
Let's talk about your use case and add it to the document that already exists describing usages of user agent data, this document was sent out to all teams couple months ago but there is no description of your use case there:
https://docs.google.com/a/wikimedia.org/document/d/1bp6qrvYi0Mh7l0s1psGnXEEN...
On Wed, Mar 26, 2014 at 11:20 PM, Dan Garry dgarry@wikimedia.org wrote:
Hey Yuvi,
As a checkuser, user agents are an important part of my workflow for identifying that multiple accounts are owned by the same person. So I'm going to have to argue for including more information in the user agent. Your proposed user agent would basically mean that every single person using the most up-to-date version of the app on a particular platform would be indistinguishable from each other. This would, unfortunately, lead to lots of innocent users getting blocked as sockpuppets.
Here's an example of a user agent from an iPhone using Safari: Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_1 like Mac OS X; zh-tw) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8G4 Safari/6533.18.5
Look at all of that wonderful information! ;-) In general, the more information you can include without breaching the user's privacy, the better.
I'd be happy to work with you on this.
Thanks, Dan
P.S. You may also want to consult with the legal team, to ensure that an unacceptable levels of private information are not given out. They would also make a complement for me; I would likely be pulling in the direction of "MOAR INFORMATION!", whereas they would likely be pulling in the direction of "LESS INFORMATION!". :-)
On 26 March 2014 15:00, Yuvi Panda yuvipanda@gmail.com wrote: > > Add Analytics to cc, as I think they'll be interested as well :) > > On Thu, Mar 27, 2014 at 3:20 AM, Yuvi Panda yuvipanda@gmail.com > wrote: > > Hello! > > > > We are getting closer to a general release of the Wikipedia Android > > and iOS apps, and I think we should standardize on a User-Agent > > format. The old app just appended an identifier in front of the > > phone's default UA[1] but I think we can do better, to avoid > > privacy > > concerns[2]. > > > > How about: > > > > WikipediaApp/<version> <OS>/<form-factor>/<version> > > > > This gives us all the info we need (App version, OS, Form Factor > > (Tablet / Phone) and OS version) without giving away too much. It > > is > > also fairly simple to construct and parse. > > > > For the latest alpha, my Nexus 4 would generate > > > > WikipediaApp/32 Android/Phone/4.4 > > > > While an iOS device might generate > > > > WkipediaApp/2.0 iOS/Phone/7.1 > > > > form-factor would just be Phone|Tablet for now, and can be expanded > > later if necessary. > > > > Thoughts? > > > > [1]: https://www.mediawiki.org/wiki/Mobile/User_agents#Apps > > [2]: > > https://www.mediawiki.org/wiki/EventLogging/UserAgentSanitization > > -- > > Yuvi Panda T > > http://yuvi.in/blog > > > > -- > Yuvi Panda T > http://yuvi.in/blog > > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics
-- Dan Garry Associate Product Manager for Platform Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Speaking in my capacity as both a long term volunteer checkuser (though not currently because of my work requirements) a very active work related owner/user of checkuser in the LCA team [probably the most active within staff], and a strong advocate for saving as little info as possible I think your proposed adjustment makes sense.
That (assuming it's done for all logged actions as you suggest) seems like it would fit in to the CU requirements well while saving as little information as needed on readers.
You say that the 2nd 'edit' user agent will be sent as a separate header, I imagine that would still be recorded in the read logs then, is it just that it wouldn't be saved long term after the logs are processed in some way to remove other headers? [That would make sense to me, but if it's going to be kept in the logs as long as the user agent in the first place I don't know why we wouldn't just switch was was being sent 'as' the user agent].
James
James Alexander Legal and Community Advocacy Wikimedia Foundation (415) 839-6885 x6716 @jamesofur
On Thu, Mar 27, 2014 at 1:45 PM, Yuvi Panda yuvipanda@gmail.com wrote:
Forking since I think there are two conversations - one about the format of UA for the mobile apps and one about CheckUser requirements for anything that does edits. Having them separate would be useful.
For those who do not know what CheckUser means, I recommend reading https://en.wikipedia.org/wiki/Wikipedia:CheckUser.
IP address and UA are amongst the two most important pieces of info CUs have in helping prevent abuse. IP is already sortof useless with mobile networks - a lot of providers do NAT and similar things that mean that we can not remotely close to reasonably assume 1 IP = 1 User, or anything remotely similar to that. UA provides more fingerprinting ability, but CU isn't the only thing that consumes UA - other parts of the infrastructure do as well.
So what we need, is a way to preserve the ability to fingerprint only users making edits (no read actions!) for CU. I am sure that can be implemented without having to have a very fingerprintable UA, with simple hooks on both the App's side and on Extension:CheckUser.
We could generate a simple fingerprint that's unique per device (and disconnected completely from every other device identifier) that we send only with edits (and other 'POST' actions) as a separate header. This can be processed by CU (perhaps with a hook that Extension:MobileApp can hook into) and then used by CheckUsers. This data will be treated with the same data retention / privacy policy that applies to CUs now, and regular UA data can be consumed by other consumers without too much fingerprinting concerns.
I talked to hoo and he said the CU hook shouldn't be too much of a problem, and the app side of the issue is rather simple too. Deskana (speaking solely as a volunteer CU) says that this solution is acceptable to him. Thoughts other people?
On Thu, Mar 27, 2014 at 10:43 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Repost, because filtering; there might be a point of confusion here
that's
causing the problem. As I understand it, the user agent sanitisation is expected to apply to EventLogging data, and data in the Analytics
pipeline,
but not data streaming into MediaWiki proper - namely, the cu_changes
table.
Nuria, is that the case?
On 27 March 2014 08:16, Nuria Ruiz nuria@wikimedia.org wrote:
Rather than having an ethical debate over it, we could always test the actual usefulness with Science. That way we'd be able to see how much granularity each additional component adds to the data.
I kind of feel we are going backwards as we throughly discussed this point, technical info and references regarding entropy and user agents
and
fingerprinting can be found here: https://www.mediawiki.org/wiki/EventLogging/UserAgentSanitization
On Thu, Mar 27, 2014 at 3:49 PM, Oliver Keyes okeyes@wikimedia.org wrote:
+1. I'm totally down for keeping less information around, but if it
gets
in the way of people doing their job?
Rather than having an ethical debate over it, we could always test the actual usefulness with Science. That way we'd be able to see how much granularity each additional component adds to the data.
On 27 March 2014 07:15, Aaron Halfaker ahalfaker@wikimedia.org
wrote:
Including more information on the UA, while being covered by legal under the new privacy policy, really goes agains the wishes of the
community
as they do not wish to be finger printed.
I don't think that "the wishes of the community" have been established and the whole point of checkuser is that it allows for fingerprinting.
On Thu, Mar 27, 2014 at 4:20 AM, Nuria Ruiz nuria@wikimedia.org
wrote:
>As a checkuser, user agents are an important part of my workflow for > identifying that multiple accounts are owned by the same person. > So I'm going to have to argue for including more information in the > user agent.
Including more information on the UA, while being covered by legal under the new privacy policy, really goes agains the wishes of the
community
as they do not wish to be finger printed. See:
https://www.mediawiki.org/wiki/Talk:EventLogging/UserAgentSanitization or
https://meta.wikimedia.org/wiki/Talk:Privacy_policy There has been plenty more discussions about this on analytics e-mail list.
>Your proposed user agent would basically mean that every single
person
> using the most up-to-date version of the app on a particular
platform would
> >be indistinguishable from each other. This would, unfortunately,
lead to
> lots of innocent users getting blocked as sockpuppets.
However, note that the UA " WikipediaApp/<version> <OS>/<form-factor>/<version>" clearly satisfies the use case of the
mobile
team. It provides as much information as they need from their user
without
sending any private data.
Can you please list what is your use case? Namely how are you identifying "false" accounts. Perhaps relying on the user agent to
do so is
not the best strategy going forward. Have in mind that with the old
privacy
policy UA data needed to be discarded after 90 days. With the new
policy
there is more legal room but given community feedback analytics team
is
planning on aggregating all UA information in the future. This means
that UA
data will not be stored (or reported) per user or request but rather agreggated (as in "4% of users use iPhone").
We gathered recently information from all teams as to use cases pertaining UA data collection:
https://office.wikimedia.org/wiki/Analytics/Internal/EventLogging/PrivateDat... .
Let's talk about your use case and add it to the document that
already
exists describing usages of user agent data, this document was sent
out to
all teams couple months ago but there is no description of your use
case
there:
https://docs.google.com/a/wikimedia.org/document/d/1bp6qrvYi0Mh7l0s1psGnXEEN...
On Wed, Mar 26, 2014 at 11:20 PM, Dan Garry dgarry@wikimedia.org wrote: > > Hey Yuvi, > > As a checkuser, user agents are an important part of my workflow for > identifying that multiple accounts are owned by the same person. So
I'm
> going to have to argue for including more information in the user
agent.
> Your proposed user agent would basically mean that every single
person using
> the most up-to-date version of the app on a particular platform
would be
> indistinguishable from each other. This would, unfortunately, lead
to lots
> of innocent users getting blocked as sockpuppets. > > Here's an example of a user agent from an iPhone using Safari: > Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_1 like Mac OS X; zh-tw) > AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8G4 > Safari/6533.18.5 > > Look at all of that wonderful information! ;-) In general, the more > information you can include without breaching the user's privacy,
the
> better. > > I'd be happy to work with you on this. > > Thanks, > Dan > > P.S. You may also want to consult with the legal team, to ensure
that
> an unacceptable levels of private information are not given out.
They would
> also make a complement for me; I would likely be pulling in the
direction of
> "MOAR INFORMATION!", whereas they would likely be pulling in the
direction
> of "LESS INFORMATION!". :-) > > > On 26 March 2014 15:00, Yuvi Panda yuvipanda@gmail.com wrote: >> >> Add Analytics to cc, as I think they'll be interested as well :) >> >> On Thu, Mar 27, 2014 at 3:20 AM, Yuvi Panda yuvipanda@gmail.com >> wrote: >> > Hello! >> > >> > We are getting closer to a general release of the Wikipedia
Android
>> > and iOS apps, and I think we should standardize on a User-Agent >> > format. The old app just appended an identifier in front of the >> > phone's default UA[1] but I think we can do better, to avoid >> > privacy >> > concerns[2]. >> > >> > How about: >> > >> > WikipediaApp/<version> <OS>/<form-factor>/<version> >> > >> > This gives us all the info we need (App version, OS, Form Factor >> > (Tablet / Phone) and OS version) without giving away too much. It >> > is >> > also fairly simple to construct and parse. >> > >> > For the latest alpha, my Nexus 4 would generate >> > >> > WikipediaApp/32 Android/Phone/4.4 >> > >> > While an iOS device might generate >> > >> > WkipediaApp/2.0 iOS/Phone/7.1 >> > >> > form-factor would just be Phone|Tablet for now, and can be
expanded
>> > later if necessary. >> > >> > Thoughts? >> > >> > [1]: https://www.mediawiki.org/wiki/Mobile/User_agents#Apps >> > [2]: >> >
https://www.mediawiki.org/wiki/EventLogging/UserAgentSanitization
>> > -- >> > Yuvi Panda T >> > http://yuvi.in/blog >> >> >> >> -- >> Yuvi Panda T >> http://yuvi.in/blog >> >> _______________________________________________ >> Analytics mailing list >> Analytics@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/analytics > > > > > -- > Dan Garry > Associate Product Manager for Platform > Wikimedia Foundation > > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics >
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
-- Yuvi Panda T http://yuvi.in/blog
Mobile-l mailing list Mobile-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mobile-l
Note: I speak in this thread as a volunteer checkuser, not as the product manager for platform. Not sending from my volunteer email address because I don't want to subscribe to this list on two separate email addresses. :-)
The originally proposed spec (user agent to include device type and app version) would have been very disruptive to the workflow of checkusers, which relies in part on user agent data. Your proposed update (including a client-specific identifier with the user agent for CU), is sensible, and lets checkuser do their jobs of dealing with people abusing our sites without unnecessarily divulging personally identifying information to the checkusers.
If any interested parties have improvements to propose, let's hear them!
Thanks, Dan
On 27 March 2014 14:30, James Alexander jalexander@wikimedia.org wrote:
Speaking in my capacity as both a long term volunteer checkuser (though not currently because of my work requirements) a very active work related owner/user of checkuser in the LCA team [probably the most active within staff], and a strong advocate for saving as little info as possible I think your proposed adjustment makes sense.
That (assuming it's done for all logged actions as you suggest) seems like it would fit in to the CU requirements well while saving as little information as needed on readers.
You say that the 2nd 'edit' user agent will be sent as a separate header, I imagine that would still be recorded in the read logs then, is it just that it wouldn't be saved long term after the logs are processed in some way to remove other headers? [That would make sense to me, but if it's going to be kept in the logs as long as the user agent in the first place I don't know why we wouldn't just switch was was being sent 'as' the user agent].
James
James Alexander Legal and Community Advocacy Wikimedia Foundation (415) 839-6885 x6716 @jamesofur
On Thu, Mar 27, 2014 at 1:45 PM, Yuvi Panda yuvipanda@gmail.com wrote:
Forking since I think there are two conversations - one about the format of UA for the mobile apps and one about CheckUser requirements for anything that does edits. Having them separate would be useful.
For those who do not know what CheckUser means, I recommend reading https://en.wikipedia.org/wiki/Wikipedia:CheckUser.
IP address and UA are amongst the two most important pieces of info CUs have in helping prevent abuse. IP is already sortof useless with mobile networks - a lot of providers do NAT and similar things that mean that we can not remotely close to reasonably assume 1 IP = 1 User, or anything remotely similar to that. UA provides more fingerprinting ability, but CU isn't the only thing that consumes UA - other parts of the infrastructure do as well.
So what we need, is a way to preserve the ability to fingerprint only users making edits (no read actions!) for CU. I am sure that can be implemented without having to have a very fingerprintable UA, with simple hooks on both the App's side and on Extension:CheckUser.
We could generate a simple fingerprint that's unique per device (and disconnected completely from every other device identifier) that we send only with edits (and other 'POST' actions) as a separate header. This can be processed by CU (perhaps with a hook that Extension:MobileApp can hook into) and then used by CheckUsers. This data will be treated with the same data retention / privacy policy that applies to CUs now, and regular UA data can be consumed by other consumers without too much fingerprinting concerns.
I talked to hoo and he said the CU hook shouldn't be too much of a problem, and the app side of the issue is rather simple too. Deskana (speaking solely as a volunteer CU) says that this solution is acceptable to him. Thoughts other people?
On Thu, Mar 27, 2014 at 10:43 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Repost, because filtering; there might be a point of confusion here
that's
causing the problem. As I understand it, the user agent sanitisation is expected to apply to EventLogging data, and data in the Analytics
pipeline,
but not data streaming into MediaWiki proper - namely, the cu_changes
table.
Nuria, is that the case?
On 27 March 2014 08:16, Nuria Ruiz nuria@wikimedia.org wrote:
Rather than having an ethical debate over it, we could always test the actual usefulness with Science. That way we'd be able to see how much granularity each additional component adds to the data.
I kind of feel we are going backwards as we throughly discussed this point, technical info and references regarding entropy and user agents
and
fingerprinting can be found here: https://www.mediawiki.org/wiki/EventLogging/UserAgentSanitization
On Thu, Mar 27, 2014 at 3:49 PM, Oliver Keyes okeyes@wikimedia.org wrote:
+1. I'm totally down for keeping less information around, but if it
gets
in the way of people doing their job?
Rather than having an ethical debate over it, we could always test the actual usefulness with Science. That way we'd be able to see how much granularity each additional component adds to the data.
On 27 March 2014 07:15, Aaron Halfaker ahalfaker@wikimedia.org
wrote:
> > Including more information on the UA, while being covered by legal > under the new privacy policy, really goes agains the wishes of the
community
> as they do not wish to be finger printed.
I don't think that "the wishes of the community" have been
established
and the whole point of checkuser is that it allows for
fingerprinting.
On Thu, Mar 27, 2014 at 4:20 AM, Nuria Ruiz nuria@wikimedia.org
wrote:
> > > >As a checkuser, user agents are an important part of my workflow
for
> > identifying that multiple accounts are owned by the same person. > > So I'm going to have to argue for including more information in
the
> > user agent. > > Including more information on the UA, while being covered by legal > under the new privacy policy, really goes agains the wishes of the
community
> as they do not wish to be finger printed. > See: >
https://www.mediawiki.org/wiki/Talk:EventLogging/UserAgentSanitization or
> https://meta.wikimedia.org/wiki/Talk:Privacy_policy > There has been plenty more discussions about this on analytics
> list. > > > >Your proposed user agent would basically mean that every single
person
> > using the most up-to-date version of the app on a particular
platform would
> > >be indistinguishable from each other. This would, unfortunately,
lead to
> > lots of innocent users getting blocked as sockpuppets. > > However, note that the UA " WikipediaApp/<version> > <OS>/<form-factor>/<version>" clearly satisfies the use case of the
mobile
> team. It provides as much information as they need from their user
without
> sending any private data. > > Can you please list what is your use case? Namely how are you > identifying "false" accounts. Perhaps relying on the user agent to
do so is
> not the best strategy going forward. Have in mind that with the old
privacy
> policy UA data needed to be discarded after 90 days. With the new
policy
> there is more legal room but given community feedback analytics
team is
> planning on aggregating all UA information in the future. This
means that UA
> data will not be stored (or reported) per user or request but rather > agreggated (as in "4% of users use iPhone"). > > We gathered recently information from all teams as to use cases > pertaining UA data collection: > >
https://office.wikimedia.org/wiki/Analytics/Internal/EventLogging/PrivateDat... .
> > Let's talk about your use case and add it to the document that
already
> exists describing usages of user agent data, this document was sent
out to
> all teams couple months ago but there is no description of your use
case
> there: > >
https://docs.google.com/a/wikimedia.org/document/d/1bp6qrvYi0Mh7l0s1psGnXEEN...
> > > > > > > On Wed, Mar 26, 2014 at 11:20 PM, Dan Garry dgarry@wikimedia.org > wrote: >> >> Hey Yuvi, >> >> As a checkuser, user agents are an important part of my workflow
for
>> identifying that multiple accounts are owned by the same person.
So I'm
>> going to have to argue for including more information in the user
agent.
>> Your proposed user agent would basically mean that every single
person using
>> the most up-to-date version of the app on a particular platform
would be
>> indistinguishable from each other. This would, unfortunately, lead
to lots
>> of innocent users getting blocked as sockpuppets. >> >> Here's an example of a user agent from an iPhone using Safari: >> Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_1 like Mac OS X; zh-tw) >> AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8G4 >> Safari/6533.18.5 >> >> Look at all of that wonderful information! ;-) In general, the more >> information you can include without breaching the user's privacy,
the
>> better. >> >> I'd be happy to work with you on this. >> >> Thanks, >> Dan >> >> P.S. You may also want to consult with the legal team, to ensure
that
>> an unacceptable levels of private information are not given out.
They would
>> also make a complement for me; I would likely be pulling in the
direction of
>> "MOAR INFORMATION!", whereas they would likely be pulling in the
direction
>> of "LESS INFORMATION!". :-) >> >> >> On 26 March 2014 15:00, Yuvi Panda yuvipanda@gmail.com wrote: >>> >>> Add Analytics to cc, as I think they'll be interested as well :) >>> >>> On Thu, Mar 27, 2014 at 3:20 AM, Yuvi Panda yuvipanda@gmail.com >>> wrote: >>> > Hello! >>> > >>> > We are getting closer to a general release of the Wikipedia
Android
>>> > and iOS apps, and I think we should standardize on a User-Agent >>> > format. The old app just appended an identifier in front of the >>> > phone's default UA[1] but I think we can do better, to avoid >>> > privacy >>> > concerns[2]. >>> > >>> > How about: >>> > >>> > WikipediaApp/<version> <OS>/<form-factor>/<version> >>> > >>> > This gives us all the info we need (App version, OS, Form Factor >>> > (Tablet / Phone) and OS version) without giving away too much.
It
>>> > is >>> > also fairly simple to construct and parse. >>> > >>> > For the latest alpha, my Nexus 4 would generate >>> > >>> > WikipediaApp/32 Android/Phone/4.4 >>> > >>> > While an iOS device might generate >>> > >>> > WkipediaApp/2.0 iOS/Phone/7.1 >>> > >>> > form-factor would just be Phone|Tablet for now, and can be
expanded
>>> > later if necessary. >>> > >>> > Thoughts? >>> > >>> > [1]: https://www.mediawiki.org/wiki/Mobile/User_agents#Apps >>> > [2]: >>> >
https://www.mediawiki.org/wiki/EventLogging/UserAgentSanitization
>>> > -- >>> > Yuvi Panda T >>> > http://yuvi.in/blog >>> >>> >>> >>> -- >>> Yuvi Panda T >>> http://yuvi.in/blog >>> >>> _______________________________________________ >>> Analytics mailing list >>> Analytics@lists.wikimedia.org >>> https://lists.wikimedia.org/mailman/listinfo/analytics >> >> >> >> >> -- >> Dan Garry >> Associate Product Manager for Platform >> Wikimedia Foundation >> >> _______________________________________________ >> Analytics mailing list >> Analytics@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/analytics >> > > > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics >
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
-- Yuvi Panda T http://yuvi.in/blog
Mobile-l mailing list Mobile-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mobile-l
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Hey folks -- we aren't considering changing any of the data that goes into checkuser. That tool will be unchanged.
This discussion only concerns backend logging EventLogging and page view analytics.
thanks,
-Toby
On Thu, Mar 27, 2014 at 3:30 PM, Dan Garry dgarry@wikimedia.org wrote:
Note: I speak in this thread as a volunteer checkuser, not as the product manager for platform. Not sending from my volunteer email address because I don't want to subscribe to this list on two separate email addresses. :-)
The originally proposed spec (user agent to include device type and app version) would have been very disruptive to the workflow of checkusers, which relies in part on user agent data. Your proposed update (including a client-specific identifier with the user agent for CU), is sensible, and lets checkuser do their jobs of dealing with people abusing our sites without unnecessarily divulging personally identifying information to the checkusers.
If any interested parties have improvements to propose, let's hear them!
Thanks, Dan
On 27 March 2014 14:30, James Alexander jalexander@wikimedia.org wrote:
Speaking in my capacity as both a long term volunteer checkuser (though not currently because of my work requirements) a very active work related owner/user of checkuser in the LCA team [probably the most active within staff], and a strong advocate for saving as little info as possible I think your proposed adjustment makes sense.
That (assuming it's done for all logged actions as you suggest) seems like it would fit in to the CU requirements well while saving as little information as needed on readers.
You say that the 2nd 'edit' user agent will be sent as a separate header, I imagine that would still be recorded in the read logs then, is it just that it wouldn't be saved long term after the logs are processed in some way to remove other headers? [That would make sense to me, but if it's going to be kept in the logs as long as the user agent in the first place I don't know why we wouldn't just switch was was being sent 'as' the user agent].
James
James Alexander Legal and Community Advocacy Wikimedia Foundation (415) 839-6885 x6716 @jamesofur
On Thu, Mar 27, 2014 at 1:45 PM, Yuvi Panda yuvipanda@gmail.com wrote:
Forking since I think there are two conversations - one about the format of UA for the mobile apps and one about CheckUser requirements for anything that does edits. Having them separate would be useful.
For those who do not know what CheckUser means, I recommend reading https://en.wikipedia.org/wiki/Wikipedia:CheckUser.
IP address and UA are amongst the two most important pieces of info CUs have in helping prevent abuse. IP is already sortof useless with mobile networks - a lot of providers do NAT and similar things that mean that we can not remotely close to reasonably assume 1 IP = 1 User, or anything remotely similar to that. UA provides more fingerprinting ability, but CU isn't the only thing that consumes UA - other parts of the infrastructure do as well.
So what we need, is a way to preserve the ability to fingerprint only users making edits (no read actions!) for CU. I am sure that can be implemented without having to have a very fingerprintable UA, with simple hooks on both the App's side and on Extension:CheckUser.
We could generate a simple fingerprint that's unique per device (and disconnected completely from every other device identifier) that we send only with edits (and other 'POST' actions) as a separate header. This can be processed by CU (perhaps with a hook that Extension:MobileApp can hook into) and then used by CheckUsers. This data will be treated with the same data retention / privacy policy that applies to CUs now, and regular UA data can be consumed by other consumers without too much fingerprinting concerns.
I talked to hoo and he said the CU hook shouldn't be too much of a problem, and the app side of the issue is rather simple too. Deskana (speaking solely as a volunteer CU) says that this solution is acceptable to him. Thoughts other people?
On Thu, Mar 27, 2014 at 10:43 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Repost, because filtering; there might be a point of confusion here
that's
causing the problem. As I understand it, the user agent sanitisation is expected to apply to EventLogging data, and data in the Analytics
pipeline,
but not data streaming into MediaWiki proper - namely, the cu_changes
table.
Nuria, is that the case?
On 27 March 2014 08:16, Nuria Ruiz nuria@wikimedia.org wrote:
Rather than having an ethical debate over it, we could always test
the
actual usefulness with Science. That way we'd be able to see how
much
granularity each additional component adds to the data.
I kind of feel we are going backwards as we throughly discussed this point, technical info and references regarding entropy and user
agents and
fingerprinting can be found here: https://www.mediawiki.org/wiki/EventLogging/UserAgentSanitization
On Thu, Mar 27, 2014 at 3:49 PM, Oliver Keyes okeyes@wikimedia.org wrote:
+1. I'm totally down for keeping less information around, but if it
gets
in the way of people doing their job?
Rather than having an ethical debate over it, we could always test
the
actual usefulness with Science. That way we'd be able to see how much granularity each additional component adds to the data.
On 27 March 2014 07:15, Aaron Halfaker ahalfaker@wikimedia.org
wrote:
>> >> Including more information on the UA, while being covered by legal >> under the new privacy policy, really goes agains the wishes of the
community
>> as they do not wish to be finger printed. > > > I don't think that "the wishes of the community" have been
established
> and the whole point of checkuser is that it allows for
fingerprinting.
> > > On Thu, Mar 27, 2014 at 4:20 AM, Nuria Ruiz nuria@wikimedia.org
wrote:
>> >> >> >As a checkuser, user agents are an important part of my workflow
for
>> > identifying that multiple accounts are owned by the same person. >> > So I'm going to have to argue for including more information in
the
>> > user agent. >> >> Including more information on the UA, while being covered by legal >> under the new privacy policy, really goes agains the wishes of the
community
>> as they do not wish to be finger printed. >> See: >>
https://www.mediawiki.org/wiki/Talk:EventLogging/UserAgentSanitizationor
>> https://meta.wikimedia.org/wiki/Talk:Privacy_policy >> There has been plenty more discussions about this on analytics
>> list. >> >> >> >Your proposed user agent would basically mean that every single
person
>> > using the most up-to-date version of the app on a particular
platform would
>> > >be indistinguishable from each other. This would,
unfortunately, lead to
>> > lots of innocent users getting blocked as sockpuppets. >> >> However, note that the UA " WikipediaApp/<version> >> <OS>/<form-factor>/<version>" clearly satisfies the use case of
the mobile
>> team. It provides as much information as they need from their user
without
>> sending any private data. >> >> Can you please list what is your use case? Namely how are you >> identifying "false" accounts. Perhaps relying on the user agent to
do so is
>> not the best strategy going forward. Have in mind that with the
old privacy
>> policy UA data needed to be discarded after 90 days. With the new
policy
>> there is more legal room but given community feedback analytics
team is
>> planning on aggregating all UA information in the future. This
means that UA
>> data will not be stored (or reported) per user or request but
rather
>> agreggated (as in "4% of users use iPhone"). >> >> We gathered recently information from all teams as to use cases >> pertaining UA data collection: >> >>
https://office.wikimedia.org/wiki/Analytics/Internal/EventLogging/PrivateDat... .
>> >> Let's talk about your use case and add it to the document that
already
>> exists describing usages of user agent data, this document was
sent out to
>> all teams couple months ago but there is no description of your
use case
>> there: >> >>
https://docs.google.com/a/wikimedia.org/document/d/1bp6qrvYi0Mh7l0s1psGnXEEN...
>> >> >> >> >> >> >> On Wed, Mar 26, 2014 at 11:20 PM, Dan Garry dgarry@wikimedia.org >> wrote: >>> >>> Hey Yuvi, >>> >>> As a checkuser, user agents are an important part of my workflow
for
>>> identifying that multiple accounts are owned by the same person.
So I'm
>>> going to have to argue for including more information in the user
agent.
>>> Your proposed user agent would basically mean that every single
person using
>>> the most up-to-date version of the app on a particular platform
would be
>>> indistinguishable from each other. This would, unfortunately,
lead to lots
>>> of innocent users getting blocked as sockpuppets. >>> >>> Here's an example of a user agent from an iPhone using Safari: >>> Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_1 like Mac OS X; zh-tw) >>> AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8G4 >>> Safari/6533.18.5 >>> >>> Look at all of that wonderful information! ;-) In general, the
more
>>> information you can include without breaching the user's privacy,
the
>>> better. >>> >>> I'd be happy to work with you on this. >>> >>> Thanks, >>> Dan >>> >>> P.S. You may also want to consult with the legal team, to ensure
that
>>> an unacceptable levels of private information are not given out.
They would
>>> also make a complement for me; I would likely be pulling in the
direction of
>>> "MOAR INFORMATION!", whereas they would likely be pulling in the
direction
>>> of "LESS INFORMATION!". :-) >>> >>> >>> On 26 March 2014 15:00, Yuvi Panda yuvipanda@gmail.com wrote: >>>> >>>> Add Analytics to cc, as I think they'll be interested as well :) >>>> >>>> On Thu, Mar 27, 2014 at 3:20 AM, Yuvi Panda <yuvipanda@gmail.com
>>>> wrote: >>>> > Hello! >>>> > >>>> > We are getting closer to a general release of the Wikipedia
Android
>>>> > and iOS apps, and I think we should standardize on a User-Agent >>>> > format. The old app just appended an identifier in front of the >>>> > phone's default UA[1] but I think we can do better, to avoid >>>> > privacy >>>> > concerns[2]. >>>> > >>>> > How about: >>>> > >>>> > WikipediaApp/<version> <OS>/<form-factor>/<version> >>>> > >>>> > This gives us all the info we need (App version, OS, Form
Factor
>>>> > (Tablet / Phone) and OS version) without giving away too much.
It
>>>> > is >>>> > also fairly simple to construct and parse. >>>> > >>>> > For the latest alpha, my Nexus 4 would generate >>>> > >>>> > WikipediaApp/32 Android/Phone/4.4 >>>> > >>>> > While an iOS device might generate >>>> > >>>> > WkipediaApp/2.0 iOS/Phone/7.1 >>>> > >>>> > form-factor would just be Phone|Tablet for now, and can be
expanded
>>>> > later if necessary. >>>> > >>>> > Thoughts? >>>> > >>>> > [1]: https://www.mediawiki.org/wiki/Mobile/User_agents#Apps >>>> > [2]: >>>> >
https://www.mediawiki.org/wiki/EventLogging/UserAgentSanitization
>>>> > -- >>>> > Yuvi Panda T >>>> > http://yuvi.in/blog >>>> >>>> >>>> >>>> -- >>>> Yuvi Panda T >>>> http://yuvi.in/blog >>>> >>>> _______________________________________________ >>>> Analytics mailing list >>>> Analytics@lists.wikimedia.org >>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>> >>> >>> >>> >>> -- >>> Dan Garry >>> Associate Product Manager for Platform >>> Wikimedia Foundation >>> >>> _______________________________________________ >>> Analytics mailing list >>> Analytics@lists.wikimedia.org >>> https://lists.wikimedia.org/mailman/listinfo/analytics >>> >> >> >> _______________________________________________ >> Analytics mailing list >> Analytics@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/analytics >> > > > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics >
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
-- Yuvi Panda T http://yuvi.in/blog
Mobile-l mailing list Mobile-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mobile-l
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Dan Garry Associate Product Manager for Platform Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Actually, this discussion is a fork of the other discussion, with this fork being aimed at "What data should the Mobile App provide to CheckUser". It only related to EventLogging and page views incidentally, in that they use user agents just like CheckUser.
Thanks, Dan
On 27 March 2014 15:36, Toby Negrin tnegrin@wikimedia.org wrote:
Hey folks -- we aren't considering changing any of the data that goes into checkuser. That tool will be unchanged.
This discussion only concerns backend logging EventLogging and page view analytics.
thanks,
-Toby
On Thu, Mar 27, 2014 at 3:30 PM, Dan Garry dgarry@wikimedia.org wrote:
Note: I speak in this thread as a volunteer checkuser, not as the product manager for platform. Not sending from my volunteer email address because I don't want to subscribe to this list on two separate email addresses. :-)
The originally proposed spec (user agent to include device type and app version) would have been very disruptive to the workflow of checkusers, which relies in part on user agent data. Your proposed update (including a client-specific identifier with the user agent for CU), is sensible, and lets checkuser do their jobs of dealing with people abusing our sites without unnecessarily divulging personally identifying information to the checkusers.
If any interested parties have improvements to propose, let's hear them!
Thanks, Dan
On 27 March 2014 14:30, James Alexander jalexander@wikimedia.org wrote:
Speaking in my capacity as both a long term volunteer checkuser (though not currently because of my work requirements) a very active work related owner/user of checkuser in the LCA team [probably the most active within staff], and a strong advocate for saving as little info as possible I think your proposed adjustment makes sense.
That (assuming it's done for all logged actions as you suggest) seems like it would fit in to the CU requirements well while saving as little information as needed on readers.
You say that the 2nd 'edit' user agent will be sent as a separate header, I imagine that would still be recorded in the read logs then, is it just that it wouldn't be saved long term after the logs are processed in some way to remove other headers? [That would make sense to me, but if it's going to be kept in the logs as long as the user agent in the first place I don't know why we wouldn't just switch was was being sent 'as' the user agent].
James
James Alexander Legal and Community Advocacy Wikimedia Foundation (415) 839-6885 x6716 @jamesofur
On Thu, Mar 27, 2014 at 1:45 PM, Yuvi Panda yuvipanda@gmail.com wrote:
Forking since I think there are two conversations - one about the format of UA for the mobile apps and one about CheckUser requirements for anything that does edits. Having them separate would be useful.
For those who do not know what CheckUser means, I recommend reading https://en.wikipedia.org/wiki/Wikipedia:CheckUser.
IP address and UA are amongst the two most important pieces of info CUs have in helping prevent abuse. IP is already sortof useless with mobile networks - a lot of providers do NAT and similar things that mean that we can not remotely close to reasonably assume 1 IP = 1 User, or anything remotely similar to that. UA provides more fingerprinting ability, but CU isn't the only thing that consumes UA - other parts of the infrastructure do as well.
So what we need, is a way to preserve the ability to fingerprint only users making edits (no read actions!) for CU. I am sure that can be implemented without having to have a very fingerprintable UA, with simple hooks on both the App's side and on Extension:CheckUser.
We could generate a simple fingerprint that's unique per device (and disconnected completely from every other device identifier) that we send only with edits (and other 'POST' actions) as a separate header. This can be processed by CU (perhaps with a hook that Extension:MobileApp can hook into) and then used by CheckUsers. This data will be treated with the same data retention / privacy policy that applies to CUs now, and regular UA data can be consumed by other consumers without too much fingerprinting concerns.
I talked to hoo and he said the CU hook shouldn't be too much of a problem, and the app side of the issue is rather simple too. Deskana (speaking solely as a volunteer CU) says that this solution is acceptable to him. Thoughts other people?
On Thu, Mar 27, 2014 at 10:43 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Repost, because filtering; there might be a point of confusion here
that's
causing the problem. As I understand it, the user agent sanitisation
is
expected to apply to EventLogging data, and data in the Analytics
pipeline,
but not data streaming into MediaWiki proper - namely, the cu_changes
table.
Nuria, is that the case?
On 27 March 2014 08:16, Nuria Ruiz nuria@wikimedia.org wrote:
>Rather than having an ethical debate over it, we could always test
the
> actual usefulness with Science. That way we'd be able to see how
much
> granularity each additional component adds to the data. I kind of feel we are going backwards as we throughly discussed this point, technical info and references regarding entropy and user
agents and
fingerprinting can be found here: https://www.mediawiki.org/wiki/EventLogging/UserAgentSanitization
On Thu, Mar 27, 2014 at 3:49 PM, Oliver Keyes okeyes@wikimedia.org wrote: > > +1. I'm totally down for keeping less information around, but if it
gets
> in the way of people doing their job? > > Rather than having an ethical debate over it, we could always test
the
> actual usefulness with Science. That way we'd be able to see how
much
> granularity each additional component adds to the data. > > > On 27 March 2014 07:15, Aaron Halfaker ahalfaker@wikimedia.org
wrote:
>>> >>> Including more information on the UA, while being covered by legal >>> under the new privacy policy, really goes agains the wishes of
the community
>>> as they do not wish to be finger printed. >> >> >> I don't think that "the wishes of the community" have been
established
>> and the whole point of checkuser is that it allows for
fingerprinting.
>> >> >> On Thu, Mar 27, 2014 at 4:20 AM, Nuria Ruiz nuria@wikimedia.org
wrote:
>>> >>> >>> >As a checkuser, user agents are an important part of my workflow
for
>>> > identifying that multiple accounts are owned by the same person. >>> > So I'm going to have to argue for including more information in
the
>>> > user agent. >>> >>> Including more information on the UA, while being covered by legal >>> under the new privacy policy, really goes agains the wishes of
the community
>>> as they do not wish to be finger printed. >>> See: >>>
https://www.mediawiki.org/wiki/Talk:EventLogging/UserAgentSanitizationor
>>> https://meta.wikimedia.org/wiki/Talk:Privacy_policy >>> There has been plenty more discussions about this on analytics
>>> list. >>> >>> >>> >Your proposed user agent would basically mean that every single
person
>>> > using the most up-to-date version of the app on a particular
platform would
>>> > >be indistinguishable from each other. This would,
unfortunately, lead to
>>> > lots of innocent users getting blocked as sockpuppets. >>> >>> However, note that the UA " WikipediaApp/<version> >>> <OS>/<form-factor>/<version>" clearly satisfies the use case of
the mobile
>>> team. It provides as much information as they need from their
user without
>>> sending any private data. >>> >>> Can you please list what is your use case? Namely how are you >>> identifying "false" accounts. Perhaps relying on the user agent
to do so is
>>> not the best strategy going forward. Have in mind that with the
old privacy
>>> policy UA data needed to be discarded after 90 days. With the new
policy
>>> there is more legal room but given community feedback analytics
team is
>>> planning on aggregating all UA information in the future. This
means that UA
>>> data will not be stored (or reported) per user or request but
rather
>>> agreggated (as in "4% of users use iPhone"). >>> >>> We gathered recently information from all teams as to use cases >>> pertaining UA data collection: >>> >>>
https://office.wikimedia.org/wiki/Analytics/Internal/EventLogging/PrivateDat... .
>>> >>> Let's talk about your use case and add it to the document that
already
>>> exists describing usages of user agent data, this document was
sent out to
>>> all teams couple months ago but there is no description of your
use case
>>> there: >>> >>>
https://docs.google.com/a/wikimedia.org/document/d/1bp6qrvYi0Mh7l0s1psGnXEEN...
>>> >>> >>> >>> >>> >>> >>> On Wed, Mar 26, 2014 at 11:20 PM, Dan Garry <dgarry@wikimedia.org
>>> wrote: >>>> >>>> Hey Yuvi, >>>> >>>> As a checkuser, user agents are an important part of my workflow
for
>>>> identifying that multiple accounts are owned by the same person.
So I'm
>>>> going to have to argue for including more information in the
user agent.
>>>> Your proposed user agent would basically mean that every single
person using
>>>> the most up-to-date version of the app on a particular platform
would be
>>>> indistinguishable from each other. This would, unfortunately,
lead to lots
>>>> of innocent users getting blocked as sockpuppets. >>>> >>>> Here's an example of a user agent from an iPhone using Safari: >>>> Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_1 like Mac OS X; zh-tw) >>>> AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8G4 >>>> Safari/6533.18.5 >>>> >>>> Look at all of that wonderful information! ;-) In general, the
more
>>>> information you can include without breaching the user's
privacy, the
>>>> better. >>>> >>>> I'd be happy to work with you on this. >>>> >>>> Thanks, >>>> Dan >>>> >>>> P.S. You may also want to consult with the legal team, to ensure
that
>>>> an unacceptable levels of private information are not given out.
They would
>>>> also make a complement for me; I would likely be pulling in the
direction of
>>>> "MOAR INFORMATION!", whereas they would likely be pulling in the
direction
>>>> of "LESS INFORMATION!". :-) >>>> >>>> >>>> On 26 March 2014 15:00, Yuvi Panda yuvipanda@gmail.com wrote: >>>>> >>>>> Add Analytics to cc, as I think they'll be interested as well :) >>>>> >>>>> On Thu, Mar 27, 2014 at 3:20 AM, Yuvi Panda <
yuvipanda@gmail.com>
>>>>> wrote: >>>>> > Hello! >>>>> > >>>>> > We are getting closer to a general release of the Wikipedia
Android
>>>>> > and iOS apps, and I think we should standardize on a
User-Agent
>>>>> > format. The old app just appended an identifier in front of
the
>>>>> > phone's default UA[1] but I think we can do better, to avoid >>>>> > privacy >>>>> > concerns[2]. >>>>> > >>>>> > How about: >>>>> > >>>>> > WikipediaApp/<version> <OS>/<form-factor>/<version> >>>>> > >>>>> > This gives us all the info we need (App version, OS, Form
Factor
>>>>> > (Tablet / Phone) and OS version) without giving away too
much. It
>>>>> > is >>>>> > also fairly simple to construct and parse. >>>>> > >>>>> > For the latest alpha, my Nexus 4 would generate >>>>> > >>>>> > WikipediaApp/32 Android/Phone/4.4 >>>>> > >>>>> > While an iOS device might generate >>>>> > >>>>> > WkipediaApp/2.0 iOS/Phone/7.1 >>>>> > >>>>> > form-factor would just be Phone|Tablet for now, and can be
expanded
>>>>> > later if necessary. >>>>> > >>>>> > Thoughts? >>>>> > >>>>> > [1]: https://www.mediawiki.org/wiki/Mobile/User_agents#Apps >>>>> > [2]: >>>>> >
https://www.mediawiki.org/wiki/EventLogging/UserAgentSanitization
>>>>> > -- >>>>> > Yuvi Panda T >>>>> > http://yuvi.in/blog >>>>> >>>>> >>>>> >>>>> -- >>>>> Yuvi Panda T >>>>> http://yuvi.in/blog >>>>> >>>>> _______________________________________________ >>>>> Analytics mailing list >>>>> Analytics@lists.wikimedia.org >>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>> >>>> >>>> >>>> >>>> -- >>>> Dan Garry >>>> Associate Product Manager for Platform >>>> Wikimedia Foundation >>>> >>>> _______________________________________________ >>>> Analytics mailing list >>>> Analytics@lists.wikimedia.org >>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>> >>> >>> >>> _______________________________________________ >>> Analytics mailing list >>> Analytics@lists.wikimedia.org >>> https://lists.wikimedia.org/mailman/listinfo/analytics >>> >> >> >> _______________________________________________ >> Analytics mailing list >> Analytics@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/analytics >> > > > > -- > Oliver Keyes > Research Analyst > Wikimedia Foundation > > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics >
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
-- Yuvi Panda T http://yuvi.in/blog
Mobile-l mailing list Mobile-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mobile-l
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Dan Garry Associate Product Manager for Platform Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
We could generate a simple fingerprint that's unique per device (and disconnected completely from every other device identifier) that we send only with edits (and other 'POST' actions) as a separate header. This can be processed by CU (perhaps with a hook that Extension:MobileApp can hook into) and then used by CheckUsers. This data will be treated with the same data retention / privacy policy that applies to CUs now, and regular UA data can be consumed by other consumers without too much fingerprinting concerns.
Thanks for forking the thread for clarification. I was about to reply to the other thread with a scheme similar to this one. Given that CU data is thrown away after 90 days (like the rest of the operational data should be) this eases any fingerprinting concerns long term.
It also highlights the notion that a mobile app is not a website and that in that case the "user agent" is a made up construct. And, as you said, if the CU extension needs to work for mobile requests it cannot rely on IPs either.
On Thu, Mar 27, 2014 at 9:45 PM, Yuvi Panda yuvipanda@gmail.com wrote:
Forking since I think there are two conversations - one about the format of UA for the mobile apps and one about CheckUser requirements for anything that does edits. Having them separate would be useful.
For those who do not know what CheckUser means, I recommend reading https://en.wikipedia.org/wiki/Wikipedia:CheckUser.
IP address and UA are amongst the two most important pieces of info CUs have in helping prevent abuse. IP is already sortof useless with mobile networks - a lot of providers do NAT and similar things that mean that we can not remotely close to reasonably assume 1 IP = 1 User, or anything remotely similar to that. UA provides more fingerprinting ability, but CU isn't the only thing that consumes UA - other parts of the infrastructure do as well.
So what we need, is a way to preserve the ability to fingerprint only users making edits (no read actions!) for CU. I am sure that can be implemented without having to have a very fingerprintable UA, with simple hooks on both the App's side and on Extension:CheckUser.
We could generate a simple fingerprint that's unique per device (and disconnected completely from every other device identifier) that we send only with edits (and other 'POST' actions) as a separate header. This can be processed by CU (perhaps with a hook that Extension:MobileApp can hook into) and then used by CheckUsers. This data will be treated with the same data retention / privacy policy that applies to CUs now, and regular UA data can be consumed by other consumers without too much fingerprinting concerns.
I talked to hoo and he said the CU hook shouldn't be too much of a problem, and the app side of the issue is rather simple too. Deskana (speaking solely as a volunteer CU) says that this solution is acceptable to him. Thoughts other people?
On Thu, Mar 27, 2014 at 10:43 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Repost, because filtering; there might be a point of confusion here
that's
causing the problem. As I understand it, the user agent sanitisation is expected to apply to EventLogging data, and data in the Analytics
pipeline,
but not data streaming into MediaWiki proper - namely, the cu_changes
table.
Nuria, is that the case?
On 27 March 2014 08:16, Nuria Ruiz nuria@wikimedia.org wrote:
Rather than having an ethical debate over it, we could always test the actual usefulness with Science. That way we'd be able to see how much granularity each additional component adds to the data.
I kind of feel we are going backwards as we throughly discussed this point, technical info and references regarding entropy and user agents
and
fingerprinting can be found here: https://www.mediawiki.org/wiki/EventLogging/UserAgentSanitization
On Thu, Mar 27, 2014 at 3:49 PM, Oliver Keyes okeyes@wikimedia.org wrote:
+1. I'm totally down for keeping less information around, but if it
gets
in the way of people doing their job?
Rather than having an ethical debate over it, we could always test the actual usefulness with Science. That way we'd be able to see how much granularity each additional component adds to the data.
On 27 March 2014 07:15, Aaron Halfaker ahalfaker@wikimedia.org
wrote:
Including more information on the UA, while being covered by legal under the new privacy policy, really goes agains the wishes of the
community
as they do not wish to be finger printed.
I don't think that "the wishes of the community" have been established and the whole point of checkuser is that it allows for fingerprinting.
On Thu, Mar 27, 2014 at 4:20 AM, Nuria Ruiz nuria@wikimedia.org
wrote:
>As a checkuser, user agents are an important part of my workflow for > identifying that multiple accounts are owned by the same person. > So I'm going to have to argue for including more information in the > user agent.
Including more information on the UA, while being covered by legal under the new privacy policy, really goes agains the wishes of the
community
as they do not wish to be finger printed. See:
https://www.mediawiki.org/wiki/Talk:EventLogging/UserAgentSanitization or
https://meta.wikimedia.org/wiki/Talk:Privacy_policy There has been plenty more discussions about this on analytics e-mail list.
>Your proposed user agent would basically mean that every single
person
> using the most up-to-date version of the app on a particular
platform would
> >be indistinguishable from each other. This would, unfortunately,
lead to
> lots of innocent users getting blocked as sockpuppets.
However, note that the UA " WikipediaApp/<version> <OS>/<form-factor>/<version>" clearly satisfies the use case of the
mobile
team. It provides as much information as they need from their user
without
sending any private data.
Can you please list what is your use case? Namely how are you identifying "false" accounts. Perhaps relying on the user agent to
do so is
not the best strategy going forward. Have in mind that with the old
privacy
policy UA data needed to be discarded after 90 days. With the new
policy
there is more legal room but given community feedback analytics team
is
planning on aggregating all UA information in the future. This means
that UA
data will not be stored (or reported) per user or request but rather agreggated (as in "4% of users use iPhone").
We gathered recently information from all teams as to use cases pertaining UA data collection:
https://office.wikimedia.org/wiki/Analytics/Internal/EventLogging/PrivateDat... .
Let's talk about your use case and add it to the document that
already
exists describing usages of user agent data, this document was sent
out to
all teams couple months ago but there is no description of your use
case
there:
https://docs.google.com/a/wikimedia.org/document/d/1bp6qrvYi0Mh7l0s1psGnXEEN...
On Wed, Mar 26, 2014 at 11:20 PM, Dan Garry dgarry@wikimedia.org wrote: > > Hey Yuvi, > > As a checkuser, user agents are an important part of my workflow for > identifying that multiple accounts are owned by the same person. So
I'm
> going to have to argue for including more information in the user
agent.
> Your proposed user agent would basically mean that every single
person using
> the most up-to-date version of the app on a particular platform
would be
> indistinguishable from each other. This would, unfortunately, lead
to lots
> of innocent users getting blocked as sockpuppets. > > Here's an example of a user agent from an iPhone using Safari: > Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_1 like Mac OS X; zh-tw) > AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8G4 > Safari/6533.18.5 > > Look at all of that wonderful information! ;-) In general, the more > information you can include without breaching the user's privacy,
the
> better. > > I'd be happy to work with you on this. > > Thanks, > Dan > > P.S. You may also want to consult with the legal team, to ensure
that
> an unacceptable levels of private information are not given out.
They would
> also make a complement for me; I would likely be pulling in the
direction of
> "MOAR INFORMATION!", whereas they would likely be pulling in the
direction
> of "LESS INFORMATION!". :-) > > > On 26 March 2014 15:00, Yuvi Panda yuvipanda@gmail.com wrote: >> >> Add Analytics to cc, as I think they'll be interested as well :) >> >> On Thu, Mar 27, 2014 at 3:20 AM, Yuvi Panda yuvipanda@gmail.com >> wrote: >> > Hello! >> > >> > We are getting closer to a general release of the Wikipedia
Android
>> > and iOS apps, and I think we should standardize on a User-Agent >> > format. The old app just appended an identifier in front of the >> > phone's default UA[1] but I think we can do better, to avoid >> > privacy >> > concerns[2]. >> > >> > How about: >> > >> > WikipediaApp/<version> <OS>/<form-factor>/<version> >> > >> > This gives us all the info we need (App version, OS, Form Factor >> > (Tablet / Phone) and OS version) without giving away too much. It >> > is >> > also fairly simple to construct and parse. >> > >> > For the latest alpha, my Nexus 4 would generate >> > >> > WikipediaApp/32 Android/Phone/4.4 >> > >> > While an iOS device might generate >> > >> > WkipediaApp/2.0 iOS/Phone/7.1 >> > >> > form-factor would just be Phone|Tablet for now, and can be
expanded
>> > later if necessary. >> > >> > Thoughts? >> > >> > [1]: https://www.mediawiki.org/wiki/Mobile/User_agents#Apps >> > [2]: >> >
https://www.mediawiki.org/wiki/EventLogging/UserAgentSanitization
>> > -- >> > Yuvi Panda T >> > http://yuvi.in/blog >> >> >> >> -- >> Yuvi Panda T >> http://yuvi.in/blog >> >> _______________________________________________ >> Analytics mailing list >> Analytics@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/analytics > > > > > -- > Dan Garry > Associate Product Manager for Platform > Wikimedia Foundation > > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics >
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
-- Yuvi Panda T http://yuvi.in/blog
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics