On May 20, 2014, at 10:09 PM, Sean Pringle <springle(a)wikimedia.org> wrote:
I'd like to hear from stakeholders about purging old data from the eventlogging
database. Yes, no, why [not], etc.
I understand from Ori that there is a 90 day retention policy, and that purging has been
discussed previously but not addressed for various reasons. Certainly there are many
timestamps older than 90 days still in the db, and apparently largely untouched by
Perhaps we're in a better position now to do this properly what with data now in
multiple places: log files, database, hadoop...
Can we please purge stuff? :-)
I sent a similar proposal to the internal list for preliminary feedback (see item 2
All, I wanted to hear your thoughts informally (before
posting to the lists) on two ideas that have been floating around recently:
1) add support for optional sampling in EventLogging via JSON schemas (given the sheer
number of teams who have asked for it). See
2) introduce 90-day pruning by default for all logs, (adding a dedicated schema element
to override the default).
This would push to the customers the responsibility of ensuring the right data is
collected and retained.
I understand 2) has already been partly implemented for the raw JSON logs (not yet for EL
data stored in SQL). Obviously, we would need to audit existing logs to make sure that we
don’t discard data that needs to be retained in a sanitized or aggregate form past 90
Note that – per our data retention guidelines  – not all EL data is expected to be
automatically purged within 90 days (see the section on “Non-personal information
associated with a user account”): many of these logs have a status similar to MediaWiki
data that is retained in the DB but not fully exposed to labs. For this reason, I am
proposing that we enable 90-day pruning by default for new schemas, with the ability to
override the default. Existing schemas would need to be audited on a case by case basis.