Hey there!
Could someone from Analytics clarify the purging schedule for geoeditors_daily and add it on Wikitech https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Geoeditors#Generation? I've added some information based on my experience using the dataset, but it may not be fully accurate.
I wrote: *Because these tables contain the countries of individual editors, we only keep the data corresponding to the two most recent full months (the month of the latest mediawiki_history snapshot and the previous).*
That's right, Neil, I just changed the language around a bit, thanks for updating that!
On Tue, Nov 20, 2018 at 3:26 PM Neil Patel Quinn nquinn@wikimedia.org wrote:
Hey there!
Could someone from Analytics clarify the purging schedule for geoeditors_daily and add it on Wikitech https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Geoeditors#Generation? I've added some information based on my experience using the dataset, but it may not be fully accurate.
I wrote: *Because these tables contain the countries of individual editors, we only keep the data corresponding to the two most recent full months (the month of the latest mediawiki_history snapshot and the previous).* _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Thanks, Dan! I just expanded it even more (correcting my earlier error confusing the privacy policy and data retention guidelines). See what you think of this:
*Since this raw data identifies the location of individual editors, we keep it for only 90 days, in accordance with our data retention guidelines https://meta.wikimedia.org/wiki/Data_retention_guidelines. Data older than 90 days is continuously purged from the source cu_changes table, but since we regenerate the Data Lake's editing data every month https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster#Edit_data, we instead keep data in mediawiki_private_cu_changes and geoeditors_daily for the two latest calendar months (the month of the latest mediawiki_history https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_history snapshot and the previous). Older data may be temporarily available before it is purged, but you should not rely on this. *
On Tue, 20 Nov 2018 at 12:43, Dan Andreescu dandreescu@wikimedia.org wrote:
That's right, Neil, I just changed the language around a bit, thanks for updating that!
On Tue, Nov 20, 2018 at 3:26 PM Neil Patel Quinn nquinn@wikimedia.org wrote:
Hey there!
Could someone from Analytics clarify the purging schedule for geoeditors_daily and add it on Wikitech https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Geoeditors#Generation? I've added some information based on my experience using the dataset, but it may not be fully accurate.
I wrote: *Because these tables contain the countries of individual editors, we only keep the data corresponding to the two most recent full months (the month of the latest mediawiki_history snapshot and the previous).* _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
On Wed, 21 Nov 2018 at 11:39, Neil Patel Quinn nquinn@wikimedia.org wrote:
(correcting my earlier error confusing the privacy policy and data retention guidelines)
Well, it looks like I only *thought *I included that error. Still, now that I've figured out the distinction this might be a useful reminder for others too: in the privacy policy https://meta.wikimedia.org/wiki/Privacy_policy we commit to retaining personal data "for the shortest possible time that is consistent with the maintenance, understanding, and improvement of the Wikimedia Sites, and our obligations under applicable U.S. law." The specific period of 90 days is just our current implementation of that commitment, which we document in the data retention guideline. https://meta.wikimedia.org/wiki/Data_retention_guidelines
That's good Neil. In general though, be careful with any public releases of this particular table, it's more sensitive than recentchanges.
On Wed, Nov 21, 2018 at 2:45 PM Neil Patel Quinn nquinn@wikimedia.org wrote:
On Wed, 21 Nov 2018 at 11:39, Neil Patel Quinn nquinn@wikimedia.org wrote:
(correcting my earlier error confusing the privacy policy and data retention guidelines)
Well, it looks like I only *thought *I included that error. Still, now that I've figured out the distinction this might be a useful reminder for others too: in the privacy policy https://meta.wikimedia.org/wiki/Privacy_policy we commit to retaining personal data "for the shortest possible time that is consistent with the maintenance, understanding, and improvement of the Wikimedia Sites, and our obligations under applicable U.S. law." The specific period of 90 days is just our current implementation of that commitment, which we document in the data retention guideline. https://meta.wikimedia.org/wiki/Data_retention_guidelines _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics