Gotcha. Reading that proposal it appears to be a proposal for a methodology that will enable future proposals; where are the future proposals? It also says "in many countries, disease monitoring must be carried out at the state or metro-area level" - which countries have to be metro-level? Who are we risking the entire reader population for, here? Is it one country, or ten, or?
For what it's worth I love the idea of this kind of live stream. But I want to make sure that how the various chunks are being prioritised, and how critical they are to the outside world, is correlated - and is correlated with the underlying data's sensitivity, at that. If we're introducing risks by going down to city level and the actual use cases for city level data are limited, let's not do that - but this proposal doesn't provide thoughts on how limited those use cases are. It just says that it's required in some countries.
On 5 June 2015 at 09:35, Dan Andreescu dandreescu@wikimedia.org wrote:
My only thought is that "city" makes me uncomfortable. Did we track down a precise use case for that in the end?
Yes, the Los Alamos National Lab folks' proposal: https://meta.wikimedia.org/wiki/Research:Geo-aggregation_of_Wikipedia_pagevi...
We talked to them yesterday and it seems the time granularity is not as important. That's why that dataset is *daily* and the other one is *hourly*. Either way, these will be k-anonymized at any level. Once we have some data up, though, I'd love for people who are good at this to try and attack the datasets in combination and from different points of view like t-closeness, etc. I don't want to leak any info and any help on that is appreciated 'cause it's a hard problem.
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics