Hi Nuria,

OK, so the useragent data for edits is stored in a different database, is heavily sampled when used for research, and will still be accessible for CU use if user_agent_map  is removed from the pageview_hourly data, right?

On Mon, Sep 28, 2015 at 10:48 AM, Nuria Ruiz <nuria@wikimedia.org> wrote:
Pine:

The pageview_hourly dataset on hive contains pageviews, not edits. 

The majority of data for edits is not associated to a user-agent as it is stored on mediawiki database. Some of it comes via Eventlogging as experiments are run in, for example, visual editor. This second venue of data is of a very different nature than the one we just run this test on, it is heavily sampled, not public, and will be purged every 90 days.  
https://wikitech.wikimedia.org/wiki/Analytics/EventLogging#Data_retention_and_auto-purging


Thanks, 

Nuria












On Mon, Sep 28, 2015 at 7:23 AM, Pine W <wiki.pine@gmail.com> wrote:

Hi Nuria,

Thanks for wirking on this.

Removing user_agent_map would be only for readership data, correct? Would this data still be stored for edits, and if so, for how long?

Pine

On Sep 28, 2015 7:16 AM, "Nuria Ruiz" <nuria@wikimedia.org> wrote:
Hello, 

We have been working on the exercise of reconstructing an identity using the (still private) pageview_hourly dataset (https://wikitech.wikimedia.org/wiki/Analytics/Data/Pageview_hourly)

TL;DR
It is possible (and easy) to do that with the fields the dataset has now, before releasing it publicly we need to further anonymize it. 

More info here:

Thanks, 

Nuria

_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics



_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics