On May 20, 2014, at 10:09 PM, Sean Pringle <springle@wikimedia.org> wrote:

Hi!

I'd like to hear from stakeholders about purging old data from the eventlogging database. Yes, no, why [not], etc.

I understand from Ori that there is a 90 day retention policy, and that purging has been discussed previously but not addressed for various reasons. Certainly there are many timestamps older than 90 days still in the db, and apparently largely untouched by queries?

Perhaps we're in a better position now to do this properly what with data now in multiple places: log files, database, hadoop...

Can we please purge stuff? :-)

BR
Sean

Hi Sean, 

I sent a similar proposal to the internal list for preliminary feedback (see item 2 below)

All, I wanted to hear your thoughts informally (before posting to the lists) on two ideas that have been floating around recently:

1) add support for optional sampling in EventLogging via JSON schemas (given the sheer number of teams who have asked for it). See https://bugzilla.wikimedia.org/show_bug.cgi?id=65500

2) introduce 90-day pruning by default for all logs, (adding a dedicated schema element to override the default).

This would push to the customers the responsibility of ensuring the right data is collected and retained.

I understand 2) has already been partly implemented for the raw JSON logs (not yet for EL data stored in SQL). Obviously, we would need to audit existing logs to make sure that we don’t discard data that needs to be retained in a sanitized or aggregate form past 90 days.


Note that – per our data retention guidelines [1] – not all EL data is expected to be automatically purged within 90 days (see the section on “Non-personal information associated with a user account”): many of these logs have a status similar to MediaWiki data that is retained in the DB but not fully exposed to labs. For this reason, I am proposing that we enable 90-day pruning by default for new schemas, with the ability to override the default. Existing schemas would need to be audited on a case by case basis.

Dario


[1] https://meta.wikimedia.org/wiki/Data_retention_guidelines