Hello!
We are trying to standardize the way we sanitize and retain event data in Hive. For now, nothing will change for instrumentation data. What is changing is that we are going to apply the same sanitization process to all tables in the Hive event database, and then drop all data older than 90 days from all tables in the event database.
For analytics/instrumentation event tables, nothing is changing. If you need to keep data longer than 90 days, you will need to add an entry to the
event_sanitized_analytics allowlist (this was previously named eventlogging/whitelist.yaml) as described
here.
For main / production event tables (e.g. mediawiki_revision_create, we are now copying this data into the event_sanitized database. Tables to be copied are listed in the
event_sanitized_main allowlist.
We will soon begin applying the same purging policy to all tables in the event database. When that happens, main / production event tables in the event database will no longer have data older than 90 days. If you need to query data older than 90 days for these tables, you will find it in the event_sanitized database.
In this way, all event table sanitization and retention in Hive is done in the same way.
Docs have been updated; you can read more here:
-Andrew Otto
SRE, Data Engineering