Re: [Analytics] purging old data from eventlogging db

21 May 2014

On Tue, May 20, 2014 at 10:36 PM, Dario Taraborelli <
dtaraborelli(a)wikimedia.org&gt; wrote:

...
  On May 20, 2014, at 10:09 PM, Sean Pringle
&lt;springle(a)wikimedia.org&gt; wrote:

 Hi!

 I'd like to hear from stakeholders about purging old data from the
 eventlogging database. Yes, no, why [not], etc.

 I understand from Ori that there is a 90 day retention policy, and that
 purging has been discussed previously but not addressed for various
 reasons. Certainly there are many timestamps older than 90 days still in
 the db, and apparently largely untouched by queries?

 Perhaps we're in a better position now to do this properly what with data
 now in multiple places: log files, database, hadoop...

 Can we please purge stuff? :-)

 BR
 Sean

 Hi Sean,

 I sent a similar proposal to the internal list for preliminary feedback
 (see item 2 below)

 All, I wanted to hear your thoughts informally (before posting to the
 lists) on two ideas that have been floating around recently:

 1) add support for optional *sampling* in EventLogging via JSON schemas
 (given the sheer number of teams who have asked for it). See
 https://bugzilla.wikimedia.org/show_bug.cgi?id=65500

 Not to hijack the thread, but: to do this in the schema itself confuses the
structure of the data with the mechanics of its use. I think having a
couple of helpers in JavaScript and PHP for simple random sampling is
sufficient.

...

 2) introduce 90-day *pruning* by default for all logs, (adding a
 dedicated schema element to override the default).

 Same problem. To illustrate: suppose we're two months into a data collection
job. The researcher carelessly forgot to modify the pruning
policy, so it's set to the default 90 days, whereas the researcher needs it
for 180. At this point our options are:

1) Decline to help, even though there's a full month before the pruning
kicks in.
2) Somehow alter the schema revision without creating a new revision.
EventLogging assumes that schema revisions are immutable and it exploits
this property to provide guarantees about data validity and consistency, so
this is a nonstarter.
3) Create a new schema revision that declares a 180 day expiration and then
populate its table with a copy of each event logged under the previous
schema.

The motivation behind your proposal is (I think) a desire to have a unified
configuration interface for data collection jobs. This makes total sense
and it's worth pursuing. I just don't think we should stuff everything into
the schema. The schema is just that: a schema. It's a data model.

...
  This would push to the customers the responsibility of
ensuring the right
 data is collected and retained.

 I understand 2) has already been partly implemented for the raw JSON logs
 (not yet for EL data stored in SQL). Obviously, we would need to audit
 existing logs to make sure that we don’t discard data that needs to be
 retained in a sanitized or aggregate form past 90 days.

 Note that – per our data retention guidelines [1] – not all EL data is
 expected to be automatically purged within 90 days (see the section on
 “Non-personal information associated with a user account”): many of these
 logs have a status similar to MediaWiki data that is retained in the DB but
 not fully exposed to labs.

...
  For this reason, I am proposing that we enable 90-day
pruning by default
 for *new schemas*, with the ability to override the default.

Sounds good to me. I figure that the overrides would be specified as
configuration values for the script that does the actual pruning. We could
Puppetize that and document the process for adding exemptions.

...
  *Existing schemas* would need to be audited on a case
by case basis.

By whom? :) Surely not Sean! It'd be great to get this process going.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Re: [Analytics] purging old data from eventlogging db