Not to hijack the thread, but: to do this in the schema
itself confuses the structure of the data
for simple random sampling is sufficient.
Much agree with ori here. We would be
bloating schema with properties
that have nothing to do with data definition.
Note that – per our data retention guidelines  – not
all EL data is expected to be automatically purged within 90 days >(see the section on
“Non-personal information associated with a user account”)
I certainly think we
should keep performance data (like navigation
timing) for longer than 90 days removing pageId and userId if needed.
On Wed, May 21, 2014 at 9:03 AM, Ori Livneh <ori(a)wikimedia.org> wrote:
On Tue, May 20, 2014 at 10:36 PM, Dario Taraborelli
On May 20, 2014, at 10:09 PM, Sean Pringle <springle(a)wikimedia.org> wrote:
I'd like to hear from stakeholders about purging old data from the
eventlogging database. Yes, no, why [not], etc.
I understand from Ori that there is a 90 day retention policy, and that
purging has been discussed previously but not addressed for various reasons.
Certainly there are many timestamps older than 90 days still in the db, and
apparently largely untouched by queries?
Perhaps we're in a better position now to do this properly what with data
now in multiple places: log files, database, hadoop...
Can we please purge stuff? :-)
I sent a similar proposal to the internal list for preliminary feedback
(see item 2 below)
All, I wanted to hear your thoughts informally (before posting to the
lists) on two ideas that have been floating around recently:
1) add support for optional sampling in EventLogging via JSON schemas
(given the sheer number of teams who have asked for it). See
Not to hijack the thread, but: to do this in the schema itself confuses the
structure of the data with the mechanics of its use. I think having a couple
2) introduce 90-day pruning by default for all logs, (adding a dedicated
schema element to override the default).
Same problem. To illustrate: suppose we're two months into a data collection
job. The researcher carelessly forgot to modify the pruning policy, so it's
set to the default 90 days, whereas the researcher needs it for 180. At this
point our options are:
1) Decline to help, even though there's a full month before the pruning
2) Somehow alter the schema revision without creating a new revision.
EventLogging assumes that schema revisions are immutable and it exploits
this property to provide guarantees about data validity and consistency, so
this is a nonstarter.
3) Create a new schema revision that declares a 180 day expiration and then
populate its table with a copy of each event logged under the previous
The motivation behind your proposal is (I think) a desire to have a unified
configuration interface for data collection jobs. This makes total sense and
it's worth pursuing. I just don't think we should stuff everything into the
schema. The schema is just that: a schema. It's a data model.
This would push to the customers the responsibility of ensuring the right
data is collected and retained.
I understand 2) has already been partly implemented for the raw JSON logs
(not yet for EL data stored in SQL). Obviously, we would need to audit
existing logs to make sure that we don’t discard data that needs to be
retained in a sanitized or aggregate form past 90 days.
Note that – per our data retention guidelines  – not all EL data is
expected to be automatically purged within 90 days (see the section on
“Non-personal information associated with a user account”): many of these
logs have a status similar to MediaWiki data that is retained in the DB but
not fully exposed to labs.
For this reason, I am proposing that we enable 90-day pruning by default
for new schemas, with the ability to override the default.
Sounds good to me. I figure that the overrides would be specified as
configuration values for the script that does the actual pruning. We could
Puppetize that and document the process for adding exemptions.
Existing schemas would need to be audited on a case by case basis.
By whom? :) Surely not Sean! It'd be great to get this process going.
Analytics mailing list