Hello!
We are trying to standardize the way we sanitize and retain event data in Hive. For now, nothing will change for instrumentation data. What is changing is that we are going to apply the same sanitization process to all tables in the Hive event database, and then drop all data older than 90 days from all tables in the event database.
For analytics/instrumentation event tables, nothing is changing. If you need to keep data longer than 90 days, you will need to add an entry to the event_sanitized_analytics allowlist https://gerrit.wikimedia.org/r/plugins/gitiles/analytics/refinery/+/refs/heads/master/static_data/sanitization/event_sanitized_analytics_allowlist.yaml (this was previously named eventlogging/whitelist.yaml) as described here https://wikitech.wikimedia.org/wiki/Analytics/Systems/Event_Sanitization#Allowlists .
For main / production event tables (e.g. mediawiki_revision_create, we are now copying this data into the event_sanitized database. Tables to be copied are listed in the event_sanitized_main allowlist https://gerrit.wikimedia.org/r/plugins/gitiles/analytics/refinery/+/refs/heads/master/static_data/sanitization/event_sanitized_main_allowlist.yaml .
We will soon begin applying the same purging policy to all tables in the event database. When that happens, main / production event tables in the event database will no longer have data older than 90 days. If you need to query data older than 90 days for these tables, you will find it in the event_sanitized database.
In this way, all event table sanitization and retention in Hive is done in the same way.
Docs have been updated; you can read more here: - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Event_Sanitization - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Event_Data_retention
-Andrew Otto SRE, Data Engineering