Re: [Analytics] purging old data from eventlogging db

21 May 2014

...
 Not to hijack the thread, but: to do this in the schema
itself confuses the structure of the data
with the mechanics of its use. I think having a couple of helpers in JavaScript and PHP
 for simple random sampling is sufficient. Much agree with ori here. We would be
bloating schema with properties
that have nothing to do with data definition.

...
 Note that – per our data retention guidelines [1] – not
all EL data is expected to be automatically purged within 90 days >(see the section on
“Non-personal information associated with a user account”) I certainly think we
should keep performance data (like navigation
timing) for longer than 90 days removing pageId and userId if needed.

On Wed, May 21, 2014 at 9:03 AM, Ori Livneh &lt;ori(a)wikimedia.org&gt; wrote:
...

 On Tue, May 20, 2014 at 10:36 PM, Dario Taraborelli
 &lt;dtaraborelli(a)wikimedia.org&gt; wrote:

 On May 20, 2014, at 10:09 PM, Sean Pringle &lt;springle(a)wikimedia.org&gt; wrote:

 Hi!

 I'd like to hear from stakeholders about purging old data from the
 eventlogging database. Yes, no, why [not], etc.

 I understand from Ori that there is a 90 day retention policy, and that
 purging has been discussed previously but not addressed for various reasons.
 Certainly there are many timestamps older than 90 days still in the db, and
 apparently largely untouched by queries?

 Perhaps we're in a better position now to do this properly what with data
 now in multiple places: log files, database, hadoop...

 Can we please purge stuff? :-)

 BR
 Sean

 Hi Sean,

 I sent a similar proposal to the internal list for preliminary feedback
 (see item 2 below)

 All, I wanted to hear your thoughts informally (before posting to the
 lists) on two ideas that have been floating around recently:

 1) add support for optional sampling in EventLogging via JSON schemas
 (given the sheer number of teams who have asked for it). See
 https://bugzilla.wikimedia.org/show_bug.cgi?id=65500 
 Not to hijack the thread, but: to do this in the schema itself confuses the
 structure of the data with the mechanics of its use. I think having a couple
 of helpers in JavaScript and PHP for simple random sampling is sufficient.

 2) introduce 90-day pruning by default for all logs, (adding a dedicated
 schema element to override the default). 
 Same problem. To illustrate: suppose we're two months into a data collection
 job. The researcher carelessly forgot to modify the pruning policy, so it's
 set to the default 90 days, whereas the researcher needs it for 180. At this
 point our options are:

 1) Decline to help, even though there's a full month before the pruning
 kicks in.
 2) Somehow alter the schema revision without creating a new revision.
 EventLogging assumes that schema revisions are immutable and it exploits
 this property to provide guarantees about data validity and consistency, so
 this is a nonstarter.
 3) Create a new schema revision that declares a 180 day expiration and then
 populate its table with a copy of each event logged under the previous
 schema.

 The motivation behind your proposal is (I think) a desire to have a unified
 configuration interface for data collection jobs. This makes total sense and
 it's worth pursuing. I just don't think we should stuff everything into the
 schema. The schema is just that: a schema. It's a data model.

 This would push to the customers the responsibility of ensuring the right
 data is collected and retained.

 I understand 2) has already been partly implemented for the raw JSON logs
 (not yet for EL data stored in SQL). Obviously, we would need to audit
 existing logs to make sure that we don’t discard data that needs to be
 retained in a sanitized or aggregate form past 90 days.

 Note that – per our data retention guidelines [1] – not all EL data is
 expected to be automatically purged within 90 days (see the section on
 “Non-personal information associated with a user account”): many of these
 logs have a status similar to MediaWiki data that is retained in the DB but
 not fully exposed to labs. 

 For this reason, I am proposing that we enable 90-day pruning by default
 for new schemas, with the ability to override the default. 

 Sounds good to me. I figure that the overrides would be specified as
 configuration values for the script that does the actual pruning. We could
 Puppetize that and document the process for adding exemptions.

 Existing schemas would need to be audited on a case by case basis. 

 By whom? :) Surely not Sean! It'd be great to get this process going.

 _______________________________________________
 Analytics mailing list
 Analytics(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Re: [Analytics] purging old data from eventlogging db