I emailed mobile-l and wikitech-l about this, now I'm moving this discussion to wikimedia-l. Here's the longer technical thread:
http://lists.wikimedia.org/pipermail/mobile-l/2014-April/006884.html
In summary, to show Wikipedia Zero banners for the correct mobile networks, we are planning once for each cellular-based app session to log two pieces of data in a specialized logfile, deleting log entries older than 90 days.
1. MCC-MNC http://en.wikipedia.org/wiki/Mobile_country_code code (format is ###-##), which denotes the mobile operator 2. Exit (gateway/proxy) IP address * These data points would not be logged alongside the normal web access logs.
This information could be used to estimate rough demand for Wikipedia in potential Wikipedia Zero geos, although remediating the out-of-sync IP addresses on file for existing partners is primary.
Internal review suggests this is in alignment with privacy policy, and we wanted to see if there were other thoughts on this approach here on wikimedia-l.
-Adam
Adam Baso wrote:
In summary, to show Wikipedia Zero banners for the correct mobile networks, we are planning once for each cellular-based app session to log two pieces of data in a specialized logfile, deleting log entries older than 90 days.
- MCC-MNC http://en.wikipedia.org/wiki/Mobile_country_code code (format
is ###-##), which denotes the mobile operator 2. Exit (gateway/proxy) IP address
- These data points would not be logged alongside the normal web access
logs.
This information could be used to estimate rough demand for Wikipedia in potential Wikipedia Zero geos, although remediating the out-of-sync IP addresses on file for existing partners is primary.
Internal review suggests this is in alignment with privacy policy, and we wanted to see if there were other thoughts on this approach here on wikimedia-l.
Thanks for starting this thread.
Sorry if I've overlooked this, but who/what will have access to this data? Only members of the mobile team? Local project CheckUsers? Wikimedia Foundation-approved researchers? Wikimedia shell users? AbuseFilter filters?
And this may be a silly question, but is there a reasonable means of approximating how identifying these two data points alone are? That is, Using a mobile country code and exit IP address, is it possible to identify a particular editor or reader? Or perhaps rephrased, is this data considered anonymized?
MZMcBride
Inline.
Thanks for starting this thread.
Sorry if I've overlooked this, but who/what will have access to this data? Only members of the mobile team? Local project CheckUsers? Wikimedia Foundation-approved researchers? Wikimedia shell users? AbuseFilter filters?
It's a good question. The thought is to put it in the customary wfDebugLog location (with, for example, filename "mccmnc.log") on fluorine.
It just occurred to me that the wiki name (e.g., "enwiki"), but not the full URL, gets logged additionally as part of the wfDebugLog call; to make the implicit explicit, wfDebugLog adds a datetime stamp as well, and that's useful for purging old records. I'll forward this email to mobile-l and wikitech-l to underscore this.
And this may be a silly question, but is there a reasonable means of approximating how identifying these two data points alone are? That is, Using a mobile country code and exit IP address, is it possible to identify a particular editor or reader? Or perhaps rephrased, is this data considered anonymized?
Not a silly question. My approximation is these tuples (datetime, now that it hit me - XYwiki, exit IP, and MCC-MNC) alone, although not perfectly anonymized, are low identifying (that is, indirect inferences on the data in isolation are unlikely, but technically possible, through examination of short tail outliers in a cluster analysis where such readers/editors exist in the short tail outliers sets), in contrast to regular web access logs (where direct inferences are easy).
Thanks. I'll forward this along now.
-Adam
Hi Adam,
One thought: you don't really need the date/time data at any detailed resolution, do you? If what you're wanting it for is to track major changes ("last month it all switched to this IP") and to purge old data ("delete anything older than 10 March"), you could simply log day rather than datetime.
enwiki / 127.0.0.1 / 123.45 / 2014-04-16:1245.45
enwiki / 127.0.0.1 / 123.45 / 2014-04-16
- the latter gives you the data you need while making it a lot harder to do any kind of close user-identification.
Andrew. On 16 Apr 2014 19:17, "Adam Baso" abaso@wikimedia.org wrote:
Inline.
Thanks for starting this thread.
Sorry if I've overlooked this, but who/what will have access to this
data?
Only members of the mobile team? Local project CheckUsers? Wikimedia Foundation-approved researchers? Wikimedia shell users? AbuseFilter filters?
It's a good question. The thought is to put it in the customary wfDebugLog location (with, for example, filename "mccmnc.log") on fluorine.
It just occurred to me that the wiki name (e.g., "enwiki"), but not the full URL, gets logged additionally as part of the wfDebugLog call; to make the implicit explicit, wfDebugLog adds a datetime stamp as well, and that's useful for purging old records. I'll forward this email to mobile-l and wikitech-l to underscore this.
And this may be a silly question, but is there a reasonable means of approximating how identifying these two data points alone are? That is, Using a mobile country code and exit IP address, is it possible to identify a particular editor or reader? Or perhaps rephrased, is this
data
considered anonymized?
Not a silly question. My approximation is these tuples (datetime, now that it hit me - XYwiki, exit IP, and MCC-MNC) alone, although not perfectly anonymized, are low identifying (that is, indirect inferences on the data in isolation are unlikely, but technically possible, through examination of short tail outliers in a cluster analysis where such readers/editors exist in the short tail outliers sets), in contrast to regular web access logs (where direct inferences are easy).
Thanks. I'll forward this along now.
-Adam _______________________________________________ Wikimedia-l mailing list Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe
Great idea!
Anyone on the list know if there's a way to make the debug log facilities do the YYYYMMDD timestamp instead of the longer one?
If not, I suppose we could work to update the core MediaWiki code. [1]
-Adam
1. For those with PHP skills or equivalent, I'm referring to https://git.wikimedia.org/blob/mediawiki%2Fcore.git/a26687e81532def3faba6461.... Scroll to the bottom of the function definition to see the datetimestamp approach.
On Wed, Apr 16, 2014 at 12:47 PM, Andrew Gray andrew.gray@dunelm.org.ukwrote:
Hi Adam,
One thought: you don't really need the date/time data at any detailed resolution, do you? If what you're wanting it for is to track major changes ("last month it all switched to this IP") and to purge old data ("delete anything older than 10 March"), you could simply log day rather than datetime.
enwiki / 127.0.0.1 / 123.45 / 2014-04-16:1245.45
enwiki / 127.0.0.1 / 123.45 / 2014-04-16
- the latter gives you the data you need while making it a lot harder
to do any kind of close user-identification.
Andrew. On 16 Apr 2014 19:17, "Adam Baso" abaso@wikimedia.org wrote:
Inline.
Thanks for starting this thread.
Sorry if I've overlooked this, but who/what will have access to this
data?
Only members of the mobile team? Local project CheckUsers? Wikimedia Foundation-approved researchers? Wikimedia shell users? AbuseFilter filters?
It's a good question. The thought is to put it in the customary
wfDebugLog
location (with, for example, filename "mccmnc.log") on fluorine.
It just occurred to me that the wiki name (e.g., "enwiki"), but not the full URL, gets logged additionally as part of the wfDebugLog call; to
make
the implicit explicit, wfDebugLog adds a datetime stamp as well, and
that's
useful for purging old records. I'll forward this email to mobile-l and wikitech-l to underscore this.
And this may be a silly question, but is there a reasonable means of approximating how identifying these two data points alone are? That is, Using a mobile country code and exit IP address, is it possible to identify a particular editor or reader? Or perhaps rephrased, is this
data
considered anonymized?
Not a silly question. My approximation is these tuples (datetime, now
that
it hit me - XYwiki, exit IP, and MCC-MNC) alone, although not perfectly anonymized, are low identifying (that is, indirect inferences on the data in isolation are unlikely, but technically possible, through examination
of
short tail outliers in a cluster analysis where such readers/editors
exist
in the short tail outliers sets), in contrast to regular web access logs (where direct inferences are easy).
Thanks. I'll forward this along now.
-Adam _______________________________________________ Wikimedia-l mailing list Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe
Wikimedia-l mailing list Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe
After examining this, it looks like EventLogging is more suited to the logging task than debug logging and the trappings of needing to alter debug logging in the core MediaWiki software.
EventLogging logs at the resolution of a second (instead of a day), but has inbuilt support for record removal after 90 days.
Please do let us know in case of further questions. Here's the logging schema for those with an interest:
https://meta.wikimedia.org/wiki/Schema:MobileOperatorCode
Here's the relevant server code:
https://gerrit.wikimedia.org/r/#/c/130991/
-Adam
On Wed, Apr 16, 2014 at 2:20 PM, Adam Baso abaso@wikimedia.org wrote:
Great idea!
Anyone on the list know if there's a way to make the debug log facilities do the YYYYMMDD timestamp instead of the longer one?
If not, I suppose we could work to update the core MediaWiki code. [1]
-Adam
- For those with PHP skills or equivalent, I'm referring to
https://git.wikimedia.org/blob/mediawiki%2Fcore.git/a26687e81532def3faba6461.... Scroll to the bottom of the function definition to see the datetimestamp approach.
On Wed, Apr 16, 2014 at 12:47 PM, Andrew Gray andrew.gray@dunelm.org.ukwrote:
Hi Adam,
One thought: you don't really need the date/time data at any detailed resolution, do you? If what you're wanting it for is to track major changes ("last month it all switched to this IP") and to purge old data ("delete anything older than 10 March"), you could simply log day rather than datetime.
enwiki / 127.0.0.1 / 123.45 / 2014-04-16:1245.45
enwiki / 127.0.0.1 / 123.45 / 2014-04-16
- the latter gives you the data you need while making it a lot harder
to do any kind of close user-identification.
Andrew. On 16 Apr 2014 19:17, "Adam Baso" abaso@wikimedia.org wrote:
Inline.
Thanks for starting this thread.
Sorry if I've overlooked this, but who/what will have access to this
data?
Only members of the mobile team? Local project CheckUsers? Wikimedia Foundation-approved researchers? Wikimedia shell users? AbuseFilter filters?
It's a good question. The thought is to put it in the customary
wfDebugLog
location (with, for example, filename "mccmnc.log") on fluorine.
It just occurred to me that the wiki name (e.g., "enwiki"), but not the full URL, gets logged additionally as part of the wfDebugLog call; to
make
the implicit explicit, wfDebugLog adds a datetime stamp as well, and
that's
useful for purging old records. I'll forward this email to mobile-l and wikitech-l to underscore this.
And this may be a silly question, but is there a reasonable means of approximating how identifying these two data points alone are? That
is,
Using a mobile country code and exit IP address, is it possible to identify a particular editor or reader? Or perhaps rephrased, is this
data
considered anonymized?
Not a silly question. My approximation is these tuples (datetime, now
that
it hit me - XYwiki, exit IP, and MCC-MNC) alone, although not perfectly anonymized, are low identifying (that is, indirect inferences on the
data
in isolation are unlikely, but technically possible, through
examination of
short tail outliers in a cluster analysis where such readers/editors
exist
in the short tail outliers sets), in contrast to regular web access logs (where direct inferences are easy).
Thanks. I'll forward this along now.
-Adam _______________________________________________ Wikimedia-l mailing list Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe
Wikimedia-l mailing list Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe
Federico asked if sampling might make sense here. I think it will work, so I've updated the patchset.
From a patchset comment I provided:
"It's possible we may have situations where operators have not lots of users on them accessing Wiki(m|p)edia properties, so we do run some risk of actually missing IPs, even if exit IPs are concentrators of typically large sets of users. That said, let's try a 2% sample ratio; and if we find out it's insufficient, then we'll sample more, if it's oversampling, then we can adjust the other way, too. New patchset arriving shortly."
(I've since submitted the updated code for review.)
-Adam
On Thu, May 1, 2014 at 7:52 PM, Adam Baso abaso@wikimedia.org wrote:
After examining this, it looks like EventLogging is more suited to the logging task than debug logging and the trappings of needing to alter debug logging in the core MediaWiki software.
EventLogging logs at the resolution of a second (instead of a day), but has inbuilt support for record removal after 90 days.
Please do let us know in case of further questions. Here's the logging schema for those with an interest:
https://meta.wikimedia.org/wiki/Schema:MobileOperatorCode
Here's the relevant server code:
https://gerrit.wikimedia.org/r/#/c/130991/
-Adam
On Wed, Apr 16, 2014 at 2:20 PM, Adam Baso abaso@wikimedia.org wrote:
Great idea!
Anyone on the list know if there's a way to make the debug log facilities do the YYYYMMDD timestamp instead of the longer one?
If not, I suppose we could work to update the core MediaWiki code. [1]
-Adam
- For those with PHP skills or equivalent, I'm referring to
https://git.wikimedia.org/blob/mediawiki%2Fcore.git/a26687e81532def3faba6461.... Scroll to the bottom of the function definition to see the datetimestamp approach.
On Wed, Apr 16, 2014 at 12:47 PM, Andrew Gray andrew.gray@dunelm.org.ukwrote:
Hi Adam,
One thought: you don't really need the date/time data at any detailed resolution, do you? If what you're wanting it for is to track major changes ("last month it all switched to this IP") and to purge old data ("delete anything older than 10 March"), you could simply log day rather than datetime.
enwiki / 127.0.0.1 / 123.45 / 2014-04-16:1245.45
enwiki / 127.0.0.1 / 123.45 / 2014-04-16
- the latter gives you the data you need while making it a lot harder
to do any kind of close user-identification.
Andrew. On 16 Apr 2014 19:17, "Adam Baso" abaso@wikimedia.org wrote:
Inline.
Thanks for starting this thread.
Sorry if I've overlooked this, but who/what will have access to this
data?
Only members of the mobile team? Local project CheckUsers? Wikimedia Foundation-approved researchers? Wikimedia shell users? AbuseFilter filters?
It's a good question. The thought is to put it in the customary
wfDebugLog
location (with, for example, filename "mccmnc.log") on fluorine.
It just occurred to me that the wiki name (e.g., "enwiki"), but not the full URL, gets logged additionally as part of the wfDebugLog call; to
make
the implicit explicit, wfDebugLog adds a datetime stamp as well, and
that's
useful for purging old records. I'll forward this email to mobile-l and wikitech-l to underscore this.
And this may be a silly question, but is there a reasonable means of approximating how identifying these two data points alone are? That
is,
Using a mobile country code and exit IP address, is it possible to identify a particular editor or reader? Or perhaps rephrased, is this
data
considered anonymized?
Not a silly question. My approximation is these tuples (datetime, now
that
it hit me - XYwiki, exit IP, and MCC-MNC) alone, although not perfectly anonymized, are low identifying (that is, indirect inferences on the
data
in isolation are unlikely, but technically possible, through
examination of
short tail outliers in a cluster analysis where such readers/editors
exist
in the short tail outliers sets), in contrast to regular web access
logs
(where direct inferences are easy).
Thanks. I'll forward this along now.
-Adam _______________________________________________ Wikimedia-l mailing list Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe
Wikimedia-l mailing list Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe
Okay, the code is in place in the alphas of both the Android and iOS apps, and the server-side 2% sampling (extra header in HTTPS request sent once per cellular app session) is working.
https://git.wikimedia.org/commitdiff/apps%2Fandroid%2Fwikipedia.git/8b4a0c3b...
https://git.wikimedia.org/commitdiff/apps%2Fios%2Fwikipedia.git/59cde497921b...
https://git.wikimedia.org/commitdiff/mediawiki%2Fextensions%2FZeroRatedMobil...
Changes to event logging in the iOS alpha app (internal only at the moment, although repo can be cloned and run in the Xcode simulator) are coming pretty soon, and once those are in, we'll make one last tweak there to have the app not add the extra MCC/MNC header on that single request per cellular connection when logging is turned off in the iOS alpha app. That part is done in the Android app already.
-Adam
On Fri, May 2, 2014 at 1:16 PM, Adam Baso abaso@wikimedia.org wrote:
Federico asked if sampling might make sense here. I think it will work, so I've updated the patchset.
From a patchset comment I provided:
"It's possible we may have situations where operators have not lots of users on them accessing Wiki(m|p)edia properties, so we do run some risk of actually missing IPs, even if exit IPs are concentrators of typically large sets of users. That said, let's try a 2% sample ratio; and if we find out it's insufficient, then we'll sample more, if it's oversampling, then we can adjust the other way, too. New patchset arriving shortly."
(I've since submitted the updated code for review.)
-Adam
On Thu, May 1, 2014 at 7:52 PM, Adam Baso abaso@wikimedia.org wrote:
After examining this, it looks like EventLogging is more suited to the logging task than debug logging and the trappings of needing to alter debug logging in the core MediaWiki software.
EventLogging logs at the resolution of a second (instead of a day), but has inbuilt support for record removal after 90 days.
Please do let us know in case of further questions. Here's the logging schema for those with an interest:
https://meta.wikimedia.org/wiki/Schema:MobileOperatorCode
Here's the relevant server code:
https://gerrit.wikimedia.org/r/#/c/130991/
-Adam
On Wed, Apr 16, 2014 at 2:20 PM, Adam Baso abaso@wikimedia.org wrote:
Great idea!
Anyone on the list know if there's a way to make the debug log facilities do the YYYYMMDD timestamp instead of the longer one?
If not, I suppose we could work to update the core MediaWiki code. [1]
-Adam
- For those with PHP skills or equivalent, I'm referring to
https://git.wikimedia.org/blob/mediawiki%2Fcore.git/a26687e81532def3faba6461.... Scroll to the bottom of the function definition to see the datetimestamp approach.
On Wed, Apr 16, 2014 at 12:47 PM, Andrew Gray <andrew.gray@dunelm.org.uk
wrote:
Hi Adam,
One thought: you don't really need the date/time data at any detailed resolution, do you? If what you're wanting it for is to track major changes ("last month it all switched to this IP") and to purge old data ("delete anything older than 10 March"), you could simply log day rather than datetime.
enwiki / 127.0.0.1 / 123.45 / 2014-04-16:1245.45
enwiki / 127.0.0.1 / 123.45 / 2014-04-16
- the latter gives you the data you need while making it a lot harder
to do any kind of close user-identification.
Andrew. On 16 Apr 2014 19:17, "Adam Baso" abaso@wikimedia.org wrote:
Inline.
Thanks for starting this thread.
Sorry if I've overlooked this, but who/what will have access to this
data?
Only members of the mobile team? Local project CheckUsers? Wikimedia Foundation-approved researchers? Wikimedia shell users? AbuseFilter filters?
It's a good question. The thought is to put it in the customary
wfDebugLog
location (with, for example, filename "mccmnc.log") on fluorine.
It just occurred to me that the wiki name (e.g., "enwiki"), but not
the
full URL, gets logged additionally as part of the wfDebugLog call; to
make
the implicit explicit, wfDebugLog adds a datetime stamp as well, and
that's
useful for purging old records. I'll forward this email to mobile-l
and
wikitech-l to underscore this.
And this may be a silly question, but is there a reasonable means of approximating how identifying these two data points alone are? That
is,
Using a mobile country code and exit IP address, is it possible to identify a particular editor or reader? Or perhaps rephrased, is
this
data
considered anonymized?
Not a silly question. My approximation is these tuples (datetime, now
that
it hit me - XYwiki, exit IP, and MCC-MNC) alone, although not
perfectly
anonymized, are low identifying (that is, indirect inferences on the
data
in isolation are unlikely, but technically possible, through
examination of
short tail outliers in a cluster analysis where such readers/editors
exist
in the short tail outliers sets), in contrast to regular web access
logs
(where direct inferences are easy).
Thanks. I'll forward this along now.
-Adam _______________________________________________ Wikimedia-l mailing list Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
,
mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe
Wikimedia-l mailing list Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe
One wrinkle we've encountered and sort of expected, is that the SIM card MCC-MNC doesn't always match the actual network MCC-MNC. So on Android, we'll add both to the payload so that we can differentiate them. On iOS it looks like the API only currently allows one of these values through an opaque method call. The previous EventLogging server side code wasn't logging the User-Agent (defined coarsely in our code on both platforms). I'm thinking to make it evident when we're dealing with an iOS version of the app, it would make most sense to re-enable the User-Agent so we can pick up this coarse-grained value. I wanted to put this User-Agent item out here for a brief period before adding the code, though.
-Adam
On Fri, May 30, 2014 at 2:04 PM, Adam Baso abaso@wikimedia.org wrote:
Okay, the code is in place in the alphas of both the Android and iOS apps, and the server-side 2% sampling (extra header in HTTPS request sent once per cellular app session) is working.
https://git.wikimedia.org/commitdiff/apps%2Fandroid%2Fwikipedia.git/8b4a0c3b...
https://git.wikimedia.org/commitdiff/apps%2Fios%2Fwikipedia.git/59cde497921b...
https://git.wikimedia.org/commitdiff/mediawiki%2Fextensions%2FZeroRatedMobil...
Changes to event logging in the iOS alpha app (internal only at the moment, although repo can be cloned and run in the Xcode simulator) are coming pretty soon, and once those are in, we'll make one last tweak there to have the app not add the extra MCC/MNC header on that single request per cellular connection when logging is turned off in the iOS alpha app. That part is done in the Android app already.
-Adam
On Fri, May 2, 2014 at 1:16 PM, Adam Baso abaso@wikimedia.org wrote:
Federico asked if sampling might make sense here. I think it will work, so I've updated the patchset.
From a patchset comment I provided:
"It's possible we may have situations where operators have not lots of users on them accessing Wiki(m|p)edia properties, so we do run some risk of actually missing IPs, even if exit IPs are concentrators of typically large sets of users. That said, let's try a 2% sample ratio; and if we find out it's insufficient, then we'll sample more, if it's oversampling, then we can adjust the other way, too. New patchset arriving shortly."
(I've since submitted the updated code for review.)
-Adam
On Thu, May 1, 2014 at 7:52 PM, Adam Baso abaso@wikimedia.org wrote:
After examining this, it looks like EventLogging is more suited to the logging task than debug logging and the trappings of needing to alter debug logging in the core MediaWiki software.
EventLogging logs at the resolution of a second (instead of a day), but has inbuilt support for record removal after 90 days.
Please do let us know in case of further questions. Here's the logging schema for those with an interest:
https://meta.wikimedia.org/wiki/Schema:MobileOperatorCode
Here's the relevant server code:
https://gerrit.wikimedia.org/r/#/c/130991/
-Adam
On Wed, Apr 16, 2014 at 2:20 PM, Adam Baso abaso@wikimedia.org wrote:
Great idea!
Anyone on the list know if there's a way to make the debug log facilities do the YYYYMMDD timestamp instead of the longer one?
If not, I suppose we could work to update the core MediaWiki code. [1]
-Adam
- For those with PHP skills or equivalent, I'm referring to
https://git.wikimedia.org/blob/mediawiki%2Fcore.git/a26687e81532def3faba6461.... Scroll to the bottom of the function definition to see the datetimestamp approach.
On Wed, Apr 16, 2014 at 12:47 PM, Andrew Gray < andrew.gray@dunelm.org.uk> wrote:
Hi Adam,
One thought: you don't really need the date/time data at any detailed resolution, do you? If what you're wanting it for is to track major changes ("last month it all switched to this IP") and to purge old data ("delete anything older than 10 March"), you could simply log day rather than datetime.
enwiki / 127.0.0.1 / 123.45 / 2014-04-16:1245.45
enwiki / 127.0.0.1 / 123.45 / 2014-04-16
- the latter gives you the data you need while making it a lot harder
to do any kind of close user-identification.
Andrew. On 16 Apr 2014 19:17, "Adam Baso" abaso@wikimedia.org wrote:
Inline.
Thanks for starting this thread. > > Sorry if I've overlooked this, but who/what will have access to
this
data? > Only members of the mobile team? Local project CheckUsers?
Wikimedia
> Foundation-approved researchers? Wikimedia shell users? AbuseFilter > filters? >
It's a good question. The thought is to put it in the customary
wfDebugLog
location (with, for example, filename "mccmnc.log") on fluorine.
It just occurred to me that the wiki name (e.g., "enwiki"), but not
the
full URL, gets logged additionally as part of the wfDebugLog call;
to make
the implicit explicit, wfDebugLog adds a datetime stamp as well, and
that's
useful for purging old records. I'll forward this email to mobile-l
and
wikitech-l to underscore this.
> And this may be a silly question, but is there a reasonable means
of
> approximating how identifying these two data points alone are?
That is,
> Using a mobile country code and exit IP address, is it possible to > identify a particular editor or reader? Or perhaps rephrased, is
this
data > considered anonymized? >
Not a silly question. My approximation is these tuples (datetime,
now that
it hit me - XYwiki, exit IP, and MCC-MNC) alone, although not
perfectly
anonymized, are low identifying (that is, indirect inferences on the
data
in isolation are unlikely, but technically possible, through
examination of
short tail outliers in a cluster analysis where such readers/editors
exist
in the short tail outliers sets), in contrast to regular web access
logs
(where direct inferences are easy).
Thanks. I'll forward this along now.
-Adam _______________________________________________ Wikimedia-l mailing list Wikimedia-l@lists.wikimedia.org Unsubscribe:
https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe
Wikimedia-l mailing list Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe
Here's the patch update.
https://gerrit.wikimedia.org/r/141740
On Mon, Jun 23, 2014 at 3:30 PM, Adam Baso abaso@wikimedia.org wrote:
One wrinkle we've encountered and sort of expected, is that the SIM card MCC-MNC doesn't always match the actual network MCC-MNC. So on Android, we'll add both to the payload so that we can differentiate them. On iOS it looks like the API only currently allows one of these values through an opaque method call. The previous EventLogging server side code wasn't logging the User-Agent (defined coarsely in our code on both platforms). I'm thinking to make it evident when we're dealing with an iOS version of the app, it would make most sense to re-enable the User-Agent so we can pick up this coarse-grained value. I wanted to put this User-Agent item out here for a brief period before adding the code, though.
-Adam
On Fri, May 30, 2014 at 2:04 PM, Adam Baso abaso@wikimedia.org wrote:
Okay, the code is in place in the alphas of both the Android and iOS apps, and the server-side 2% sampling (extra header in HTTPS request sent once per cellular app session) is working.
https://git.wikimedia.org/commitdiff/apps%2Fandroid%2Fwikipedia.git/8b4a0c3b...
https://git.wikimedia.org/commitdiff/apps%2Fios%2Fwikipedia.git/59cde497921b...
https://git.wikimedia.org/commitdiff/mediawiki%2Fextensions%2FZeroRatedMobil...
Changes to event logging in the iOS alpha app (internal only at the moment, although repo can be cloned and run in the Xcode simulator) are coming pretty soon, and once those are in, we'll make one last tweak there to have the app not add the extra MCC/MNC header on that single request per cellular connection when logging is turned off in the iOS alpha app. That part is done in the Android app already.
-Adam
On Fri, May 2, 2014 at 1:16 PM, Adam Baso abaso@wikimedia.org wrote:
Federico asked if sampling might make sense here. I think it will work, so I've updated the patchset.
From a patchset comment I provided:
"It's possible we may have situations where operators have not lots of users on them accessing Wiki(m|p)edia properties, so we do run some risk of actually missing IPs, even if exit IPs are concentrators of typically large sets of users. That said, let's try a 2% sample ratio; and if we find out it's insufficient, then we'll sample more, if it's oversampling, then we can adjust the other way, too. New patchset arriving shortly."
(I've since submitted the updated code for review.)
-Adam
On Thu, May 1, 2014 at 7:52 PM, Adam Baso abaso@wikimedia.org wrote:
After examining this, it looks like EventLogging is more suited to the logging task than debug logging and the trappings of needing to alter debug logging in the core MediaWiki software.
EventLogging logs at the resolution of a second (instead of a day), but has inbuilt support for record removal after 90 days.
Please do let us know in case of further questions. Here's the logging schema for those with an interest:
https://meta.wikimedia.org/wiki/Schema:MobileOperatorCode
Here's the relevant server code:
https://gerrit.wikimedia.org/r/#/c/130991/
-Adam
On Wed, Apr 16, 2014 at 2:20 PM, Adam Baso abaso@wikimedia.org wrote:
Great idea!
Anyone on the list know if there's a way to make the debug log facilities do the YYYYMMDD timestamp instead of the longer one?
If not, I suppose we could work to update the core MediaWiki code. [1]
-Adam
- For those with PHP skills or equivalent, I'm referring to
https://git.wikimedia.org/blob/mediawiki%2Fcore.git/a26687e81532def3faba6461.... Scroll to the bottom of the function definition to see the datetimestamp approach.
On Wed, Apr 16, 2014 at 12:47 PM, Andrew Gray < andrew.gray@dunelm.org.uk> wrote:
Hi Adam,
One thought: you don't really need the date/time data at any detailed resolution, do you? If what you're wanting it for is to track major changes ("last month it all switched to this IP") and to purge old data ("delete anything older than 10 March"), you could simply log day rather than datetime.
enwiki / 127.0.0.1 / 123.45 / 2014-04-16:1245.45
enwiki / 127.0.0.1 / 123.45 / 2014-04-16
- the latter gives you the data you need while making it a lot harder
to do any kind of close user-identification.
Andrew. On 16 Apr 2014 19:17, "Adam Baso" abaso@wikimedia.org wrote:
> Inline. > > Thanks for starting this thread. > > > > Sorry if I've overlooked this, but who/what will have access to this > data? > > Only members of the mobile team? Local project CheckUsers? Wikimedia > > Foundation-approved researchers? Wikimedia shell users? AbuseFilter > > filters? > > > > It's a good question. The thought is to put it in the customary wfDebugLog > location (with, for example, filename "mccmnc.log") on fluorine. > > It just occurred to me that the wiki name (e.g., "enwiki"), but not the > full URL, gets logged additionally as part of the wfDebugLog call; to make > the implicit explicit, wfDebugLog adds a datetime stamp as well, and that's > useful for purging old records. I'll forward this email to mobile-l and > wikitech-l to underscore this. > > > > And this may be a silly question, but is there a reasonable means of > > approximating how identifying these two data points alone are? That is, > > Using a mobile country code and exit IP address, is it possible to > > identify a particular editor or reader? Or perhaps rephrased, is this > data > > considered anonymized? > > > > Not a silly question. My approximation is these tuples (datetime, now that > it hit me - XYwiki, exit IP, and MCC-MNC) alone, although not perfectly > anonymized, are low identifying (that is, indirect inferences on the data > in isolation are unlikely, but technically possible, through examination of > short tail outliers in a cluster analysis where such readers/editors exist > in the short tail outliers sets), in contrast to regular web access logs > (where direct inferences are easy). > > Thanks. I'll forward this along now. > > -Adam > _______________________________________________ > Wikimedia-l mailing list > Wikimedia-l@lists.wikimedia.org > Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, > mailto:wikimedia-l-request@lists.wikimedia.org ?subject=unsubscribe _______________________________________________ Wikimedia-l mailing list Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe
wikimedia-l@lists.wikimedia.org