On Tue, May 20, 2014 at 10:36 PM, Dario Taraborelli <
On May 20, 2014, at 10:09 PM, Sean Pringle
I'd like to hear from stakeholders about purging old data from the
eventlogging database. Yes, no, why [not], etc.
I understand from Ori that there is a 90 day retention policy, and that
purging has been discussed previously but not addressed for various
reasons. Certainly there are many timestamps older than 90 days still in
the db, and apparently largely untouched by queries?
Perhaps we're in a better position now to do this properly what with data
now in multiple places: log files, database, hadoop...
Can we please purge stuff? :-)
I sent a similar proposal to the internal list for preliminary feedback
(see item 2 below)
All, I wanted to hear your thoughts informally (before posting to the
lists) on two ideas that have been floating around recently:
1) add support for optional *sampling* in EventLogging via JSON schemas
(given the sheer number of teams who have asked for it). See
Not to hijack the thread, but: to do this in the schema itself confuses
structure of the data with the mechanics of its use. I think having a
2) introduce 90-day *pruning* by default for all logs, (adding a
dedicated schema element to override the default).
Same problem. To illustrate: suppose we're two months into a data
job. The researcher carelessly forgot to modify the pruning
policy, so it's set to the default 90 days, whereas the researcher needs it
for 180. At this point our options are:
1) Decline to help, even though there's a full month before the pruning
2) Somehow alter the schema revision without creating a new revision.
EventLogging assumes that schema revisions are immutable and it exploits
this property to provide guarantees about data validity and consistency, so
this is a nonstarter.
3) Create a new schema revision that declares a 180 day expiration and then
populate its table with a copy of each event logged under the previous
The motivation behind your proposal is (I think) a desire to have a unified
configuration interface for data collection jobs. This makes total sense
and it's worth pursuing. I just don't think we should stuff everything into
the schema. The schema is just that: a schema. It's a data model.
This would push to the customers the responsibility of
ensuring the right
data is collected and retained.
I understand 2) has already been partly implemented for the raw JSON logs
(not yet for EL data stored in SQL). Obviously, we would need to audit
existing logs to make sure that we don’t discard data that needs to be
retained in a sanitized or aggregate form past 90 days.
Note that – per our data retention guidelines  – not all EL data is
expected to be automatically purged within 90 days (see the section on
“Non-personal information associated with a user account”): many of these
logs have a status similar to MediaWiki data that is retained in the DB but
not fully exposed to labs.
For this reason, I am proposing that we enable 90-day
pruning by default
for *new schemas*, with the ability to override the default.
Sounds good to me. I figure that the overrides would be specified as
configuration values for the script that does the actual pruning. We could
Puppetize that and document the process for adding exemptions.
*Existing schemas* would need to be audited on a case
by case basis.
By whom? :) Surely not Sean! It'd be great to get this process going.