On Wed, Jan 14, 2015 at 2:25 PM, Oliver Keyes ironholds@gmail.com wrote:
I'm confused; john, could you point to the element of the collected data that isn't collected already by default in any Nginx or Apache setup? I agree that there might be a lack of user expectation, but 'silently capturing behavioral data' seems somewhat hyperbolic to describe what's actually going on.
The proposed element to be added is geolocation below country level. Default Nginx and Apache log formats do not include geolocation. Which is why this research proposal exists and is being discussed, and rightly so.
fwiw, the Nginx geoip module is not even included, by default, when compiling the source code.
As the paper explicitly describes, and is a common theme in research proposals, Wikimedia access log information is user reading behaviour being captured.
The old privacy and data retention policies gave users the expectation that access log data was destroyed after a set period, assumed to be only three months as that was the limit of Checkuser visibility. The current policies are more like "yes we collect a lot of data about users, using tracking technology, and please trust us." And "sorry we dont honour 'Dont track us', as we presumed that you trust us and the researchers that we allow to access our analytics."
We should be planning for what will be the effect when the WMF servers are hacked and _all_ of the analytics data is now in the hands of a repressive government or similar. Or, imagine the WMF sends the analytics data across an insecure link which is tapped and the data reconstructed, either due to not using secure links at all, or an accidental routing problem. https://lists.wikimedia.org/pipermail/wikimedia-l/2013-December/129357.html
If/When that day comes, hopefully they don't have much data to make inferences from, and what data they obtain can be well justified.
Having a quick peak, I thought it was odd that browser Wikimedia sites now causes impressions to be sent back to the WMF servers with the country of the user included. "This is a workaround to simplify analytics." https://git.wikimedia.org/blob/mediawiki%2Fextensions%2FCentralNotice/8ee877...
The more you collect, especially using multiple systems to collect similar data, the more likely that if subpoenaed, WMF's various datasets could be used to infer a pretty reliable answer to "which days in 2013 was John Vandenberg in Indonesia?", or "when did John Vandenberg first read the Wikipedia article about <bomb making ingredient>?" The more you publish, even aggregated, the more likely these types of questions can be inferred without a subpoena, at least for users with large enough lists of public contributions, by scientists like yourself with lots of computation power and plenty of time on their hands rifling through the data to *infer* the identify of editors, and if it is a government body they also have lots of other datasets which can be used to assist in the task.
Adding fine-grained geolocation information to published page views is an example of the latter and the paper wisely suggests not including logged in users as a possible solution to some of the privacy issues.
There is also the problem that many IPs can be easily inferred to be a single cohort of people in some situations. e.g. in regions where the only large collection of computers is an single facility, e.g. a school. In a repressive regime especially, that could lead to official questions being asked like: why were so many students at this school reading about <blah> on <date>. And teachers being identified as responsible, etc.
The paper considers IP users vs logged in users to be a binary set. However there are tools built which exploit the fact that logged in users make a logged out edit which identifies their IP. Add geolocation of pageviews and we can infer the probability that other IPs in their smallest geolocation block are also likely to be edits by the same person, as the algorithm in the paper leaks 'number of active editors in each region each day'.
The purpose of this proposed change in analytics is summarised in the paper:
'In short, the current global aggregation of Wikipedia page view is unsuitable for an operational disease monitoring system. There will be no “Wikipedia Flu Trends” unless page view data are aggregated at a finer geographic scale.'
If "Wikipedia Flu Trends" is the justification, we had better be certain that detecting Flu Trends using Wikipedia is going to be the most effective method, and isn't just an academically interesting exercise. A limited trial to determine utility would be helpful to establish if "Wikipedia Flu Trends" is a viable world health solution worthy of justifying additional data retention and publishing of aggregates.
Is there a minimum threshold at which views of a page mean it becomes 'interesting' to analyse using finer grained geographic data. I suspect that pages with only hundreds of page views per day are not particularly useful for "Wikipedia Flu Trends".
Also does "Wikipedia Flu Trends" need to have access to geographically tagged page view data of, say, me reading http://en.wiktionary.org/wiki/bota today? Is there a way to restrict which types of pages are tracked at finer geographic granularity without adversely affecting the "Wikipedia Flu Trends" graph.
-- John Vandenberg