purging old data from eventlogging db

List overview All Threads
Download

newer

older

Wikipedia featured on...

EventLogging postmortem, and...

Sean Pringle

21 May 2014 21 May '14

1:09 a.m.

Hi!

I'd like to hear from stakeholders about purging old data from the eventlogging database. Yes, no, why [not], etc.

I understand from Ori that there is a 90 day retention policy, and that purging has been discussed previously but not addressed for various reasons. Certainly there are many timestamps older than 90 days still in the db, and apparently largely untouched by queries?

Perhaps we're in a better position now to do this properly what with data now in multiple places: log files, database, hadoop...

Can we please purge stuff? :-)

BR Sean

-- DBA @ WMF

Attachments:

attachment.htm (text/html — 805 bytes)

Show replies by date

Dario Taraborelli

21 May 21 May

1:36 a.m.

On May 20, 2014, at 10:09 PM, Sean Pringle springle@wikimedia.org wrote:

...

Hi!

I'd like to hear from stakeholders about purging old data from the eventlogging database. Yes, no, why [not], etc.

I understand from Ori that there is a 90 day retention policy, and that purging has been discussed previously but not addressed for various reasons. Certainly there are many timestamps older than 90 days still in the db, and apparently largely untouched by queries?

Perhaps we're in a better position now to do this properly what with data now in multiple places: log files, database, hadoop...

Can we please purge stuff? :-)

BR Sean

Hi Sean,

I sent a similar proposal to the internal list for preliminary feedback (see item 2 below)

...

All, I wanted to hear your thoughts informally (before posting to the lists) on two ideas that have been floating around recently:

add support for optional sampling in EventLogging via JSON schemas (given the sheer number of teams who have asked for it). See https://bugzilla.wikimedia.org/show_bug.cgi?id=65500

introduce 90-day pruning by default for all logs, (adding a dedicated schema element to override the default).

This would push to the customers the responsibility of ensuring the right data is collected and retained.

I understand 2) has already been partly implemented for the raw JSON logs (not yet for EL data stored in SQL). Obviously, we would need to audit existing logs to make sure that we don’t discard data that needs to be retained in a sanitized or aggregate form past 90 days.

Note that – per our data retention guidelines [1] – not all EL data is expected to be automatically purged within 90 days (see the section on “Non-personal information associated with a user account”): many of these logs have a status similar to MediaWiki data that is retained in the DB but not fully exposed to labs. For this reason, I am proposing that we enable 90-day pruning by default for new schemas, with the ability to override the default. Existing schemas would need to be audited on a case by case basis.

Dario

[1] https://meta.wikimedia.org/wiki/Data_retention_guidelines

Ori Livneh

3:03 a.m.

On Tue, May 20, 2014 at 10:36 PM, Dario Taraborelli < dtaraborelli@wikimedia.org> wrote:

...

On May 20, 2014, at 10:09 PM, Sean Pringle springle@wikimedia.org wrote:

Hi!

I'd like to hear from stakeholders about purging old data from the eventlogging database. Yes, no, why [not], etc.

I understand from Ori that there is a 90 day retention policy, and that purging has been discussed previously but not addressed for various reasons. Certainly there are many timestamps older than 90 days still in the db, and apparently largely untouched by queries?

Perhaps we're in a better position now to do this properly what with data now in multiple places: log files, database, hadoop...

Can we please purge stuff? :-)

BR Sean

Hi Sean,

I sent a similar proposal to the internal list for preliminary feedback (see item 2 below)

All, I wanted to hear your thoughts informally (before posting to the lists) on two ideas that have been floating around recently:

add support for optional *sampling* in EventLogging via JSON schemas

(given the sheer number of teams who have asked for it). See https://bugzilla.wikimedia.org/show_bug.cgi?id=65500

Not to hijack the thread, but: to do this in the schema itself confuses

the structure of the data with the mechanics of its use. I think having a couple of helpers in JavaScript and PHP for simple random sampling is sufficient.

...

introduce 90-day *pruning* by default for all logs, (adding a

dedicated schema element to override the default).

Same problem. To illustrate: suppose we're two months into a data

collection job. The researcher carelessly forgot to modify the pruning policy, so it's set to the default 90 days, whereas the researcher needs it for 180. At this point our options are:

1) Decline to help, even though there's a full month before the pruning kicks in. 2) Somehow alter the schema revision without creating a new revision. EventLogging assumes that schema revisions are immutable and it exploits this property to provide guarantees about data validity and consistency, so this is a nonstarter. 3) Create a new schema revision that declares a 180 day expiration and then populate its table with a copy of each event logged under the previous schema.

The motivation behind your proposal is (I think) a desire to have a unified configuration interface for data collection jobs. This makes total sense and it's worth pursuing. I just don't think we should stuff everything into the schema. The schema is just that: a schema. It's a data model.

...

This would push to the customers the responsibility of ensuring the right data is collected and retained.

I understand 2) has already been partly implemented for the raw JSON logs (not yet for EL data stored in SQL). Obviously, we would need to audit existing logs to make sure that we don’t discard data that needs to be retained in a sanitized or aggregate form past 90 days.

Note that – per our data retention guidelines [1] – not all EL data is expected to be automatically purged within 90 days (see the section on “Non-personal information associated with a user account”): many of these logs have a status similar to MediaWiki data that is retained in the DB but not fully exposed to labs.

...

For this reason, I am proposing that we enable 90-day pruning by default for *new schemas*, with the ability to override the default.

Sounds good to me. I figure that the overrides would be specified as configuration values for the script that does the actual pruning. We could Puppetize that and document the process for adding exemptions.

...

*Existing schemas* would need to be audited on a case by case basis.

By whom? :) Surely not Sean! It'd be great to get this process going.

Nuria Ruiz

5:27 a.m.

...

Not to hijack the thread, but: to do this in the schema itself confuses the structure of the data with the mechanics of its use. I think having a couple of helpers in JavaScript and PHP for simple random sampling is sufficient.

Much agree with ori here. We would be bloating schema with properties that have nothing to do with data definition.

...

Note that – per our data retention guidelines [1] – not all EL data is expected to be automatically purged within 90 days >(see the section on “Non-personal information associated with a user account”)

I certainly think we should keep performance data (like navigation timing) for longer than 90 days removing pageId and userId if needed.

On Wed, May 21, 2014 at 9:03 AM, Ori Livneh ori@wikimedia.org wrote:

...

On Tue, May 20, 2014 at 10:36 PM, Dario Taraborelli dtaraborelli@wikimedia.org wrote:

...
On May 20, 2014, at 10:09 PM, Sean Pringle springle@wikimedia.org wrote:

Hi!

I'd like to hear from stakeholders about purging old data from the eventlogging database. Yes, no, why [not], etc.

I understand from Ori that there is a 90 day retention policy, and that purging has been discussed previously but not addressed for various reasons. Certainly there are many timestamps older than 90 days still in the db, and apparently largely untouched by queries?

Perhaps we're in a better position now to do this properly what with data now in multiple places: log files, database, hadoop...

Can we please purge stuff? :-)

BR Sean

Hi Sean,

I sent a similar proposal to the internal list for preliminary feedback (see item 2 below)

All, I wanted to hear your thoughts informally (before posting to the lists) on two ideas that have been floating around recently:

add support for optional sampling in EventLogging via JSON schemas

(given the sheer number of teams who have asked for it). See https://bugzilla.wikimedia.org/show_bug.cgi?id=65500

Not to hijack the thread, but: to do this in the schema itself confuses the structure of the data with the mechanics of its use. I think having a couple of helpers in JavaScript and PHP for simple random sampling is sufficient.

...

introduce 90-day pruning by default for all logs, (adding a dedicated

schema element to override the default).

Same problem. To illustrate: suppose we're two months into a data collection job. The researcher carelessly forgot to modify the pruning policy, so it's set to the default 90 days, whereas the researcher needs it for 180. At this point our options are:

Decline to help, even though there's a full month before the pruning

kicks in. 2) Somehow alter the schema revision without creating a new revision. EventLogging assumes that schema revisions are immutable and it exploits this property to provide guarantees about data validity and consistency, so this is a nonstarter. 3) Create a new schema revision that declares a 180 day expiration and then populate its table with a copy of each event logged under the previous schema.

The motivation behind your proposal is (I think) a desire to have a unified configuration interface for data collection jobs. This makes total sense and it's worth pursuing. I just don't think we should stuff everything into the schema. The schema is just that: a schema. It's a data model.

...
This would push to the customers the responsibility of ensuring the right data is collected and retained.

I understand 2) has already been partly implemented for the raw JSON logs (not yet for EL data stored in SQL). Obviously, we would need to audit existing logs to make sure that we don’t discard data that needs to be retained in a sanitized or aggregate form past 90 days.

Note that – per our data retention guidelines [1] – not all EL data is expected to be automatically purged within 90 days (see the section on “Non-personal information associated with a user account”): many of these logs have a status similar to MediaWiki data that is retained in the DB but not fully exposed to labs.

...
For this reason, I am proposing that we enable 90-day pruning by default for new schemas, with the ability to override the default.

Sounds good to me. I figure that the overrides would be specified as configuration values for the script that does the actual pruning. We could Puppetize that and document the process for adding exemptions.

...
Existing schemas would need to be audited on a case by case basis.

By whom? :) Surely not Sean! It'd be great to get this process going.

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Dario Taraborelli

10:26 a.m.

...

The motivation behind your proposal is (I think) a desire to have a unified configuration interface for data collection jobs. This makes total sense and it's worth pursuing. I just don't think we should stuff everything into the schema. The schema is just that: a schema. It's a data model.

...

Much agree with ori here. We would be bloating schema with properties that have nothing to do with data definition.

agree with both of you, these are data collection settings that do not necessarily belong in the schema itself if its job is to represent the data model.

As you know, we don’t have a solution for representing schema metadata (other than the dirty hack of schema talk pages) or data collection options. As a customer, I would value the ability to specify schema ownership (who should be contacted if something goes wrong), sampling rates (should the data be collected sampled or unsampled), retention and privacy options (should the data be retained indefinitely? should the whole log be pruned after the retention window? are there fields that include PII that should be stripped?) as well as monitoring where a specific <schema, rev_id> is deployed.

Dario

Sean Pringle

27 May 27 May

10:49 p.m.

On Wed, May 21, 2014 at 5:03 PM, Ori Livneh ori@wikimedia.org wrote:

...

On Tue, May 20, 2014 at 10:36 PM, Dario Taraborelli < dtaraborelli@wikimedia.org> wrote:

...
On May 20, 2014, at 10:09 PM, Sean Pringle springle@wikimedia.org wrote:

...
*Existing schemas* would need to be audited on a case by case basis.

By whom? :) Surely not Sean! It'd be great to get this process going.

We could start archiving data older than 90 days, see who complains about what, then put it back ;-)

Dario Taraborelli

11:17 p.m.

On May 27, 2014, at 7:49 PM, Sean Pringle springle@wikimedia.org wrote:

...

On Wed, May 21, 2014 at 5:03 PM, Ori Livneh ori@wikimedia.org wrote:

On Tue, May 20, 2014 at 10:36 PM, Dario Taraborelli dtaraborelli@wikimedia.org wrote: On May 20, 2014, at 10:09 PM, Sean Pringle springle@wikimedia.org wrote:

Existing schemas would need to be audited on a case by case basis.

By whom? :) Surely not Sean! It'd be great to get this process going.

We could start archiving data older than 90 days, see who complains about what, then put it back ;-)

Sorry, I seem to have missed Ori’s last question. I promise that if you prune logs like ServerSideAccountCreation you’ll get a lot of attention ;)

I created a few days ago a card to keep track of this [1]. Kevin and I will take the lead on this and connect to the relevant log owners.

Dario

[1] https://trello.com/c/F0DsiSXn/305-audit-historical-el-data-for-retention

Nuria

28 May 28 May

2:56 a.m.

Second Dario for NavigationTiming data. Before archiving it I would like us to have a project for processing it. Also, graphs directly query the EL data store in many instances. Removing the data would mean we will only be showing 90 days of data on dashboards, that will send many complaints our way.

On May 28, 2014, at 5:17 AM, Dario Taraborelli dtaraborelli@wikimedia.org wrote:

...

On May 27, 2014, at 7:49 PM, Sean Pringle springle@wikimedia.org wrote:

...
On Wed, May 21, 2014 at 5:03 PM, Ori Livneh ori@wikimedia.org wrote:

...
On Tue, May 20, 2014 at 10:36 PM, Dario Taraborelli dtaraborelli@wikimedia.org wrote:

...
On May 20, 2014, at 10:09 PM, Sean Pringle springle@wikimedia.org wrote:

...
Existing schemas would need to be audited on a case by case basis.

By whom? :) Surely not Sean! It'd be great to get this process going.

We could start archiving data older than 90 days, see who complains about what, then put it back ;-)

Sorry, I seem to have missed Ori’s last question. I promise that if you prune logs like ServerSideAccountCreation you’ll get a lot of attention ;)

I created a few days ago a card to keep track of this [1]. Kevin and I will take the lead on this and connect to the relevant log owners.

Dario

[1] https://trello.com/c/F0DsiSXn/305-audit-historical-el-data-for-retention

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Dan Andreescu

1:50 p.m.

I just announced this potential change in Scrum of Scrums and the Mobile team said they also would like to keep old data, but not for all of their schemas. They're cleaning up their graphs and we should check with them when we start deleting.

On Wed, May 28, 2014 at 2:56 AM, Nuria nuria@wikimedia.org wrote:

...

Second Dario for NavigationTiming data. Before archiving it I would like us to have a project for processing it. Also, graphs directly query the EL data store in many instances. Removing the data would mean we will only be showing 90 days of data on dashboards, that will send many complaints our way.

On May 28, 2014, at 5:17 AM, Dario Taraborelli dtaraborelli@wikimedia.org wrote:

On May 27, 2014, at 7:49 PM, Sean Pringle springle@wikimedia.org wrote:

On Wed, May 21, 2014 at 5:03 PM, Ori Livneh ori@wikimedia.org wrote:

...
On Tue, May 20, 2014 at 10:36 PM, Dario Taraborelli < dtaraborelli@wikimedia.org> wrote:

...
On May 20, 2014, at 10:09 PM, Sean Pringle springle@wikimedia.org wrote:

...
*Existing schemas* would need to be audited on a case by case basis.

By whom? :) Surely not Sean! It'd be great to get this process going.

We could start archiving data older than 90 days, see who complains about what, then put it back ;-)

Sorry, I seem to have missed Ori’s last question. I promise that if you prune logs like ServerSideAccountCreation you’ll get a lot of attention ;)

I created a few days ago a card to keep track of this [1]. Kevin and I will take the lead on this and connect to the relevant log owners.

Dario

[1] https://trello.com/c/F0DsiSXn/305-audit-historical-el-data-for-retention

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Steven Walling

29 May 29 May

2:26 a.m.

On Wed, May 28, 2014 at 10:50 AM, Dan Andreescu dandreescu@wikimedia.org wrote:

...

I just announced this potential change in Scrum of Scrums and the Mobile team said they also would like to keep old data, but not for all of their schemas. They're cleaning up their graphs and we should check with them when we start deleting.

Following up on this from the Growth perspective...

My main question is what the rationale is. Is it to improve query performance on analytics dbs?

I do know there are many older schemas for Growth-related experiments that are only really useful for historical analysis, which is kind of hard to reconstruct anyway. If there are sound technical reasons to chuck stuff from the relational dbs and retain it only in the raw JSON logs, then I'm potentially okay with helping figure out a list of schemas to retain and schemas to purge. Aaron, thoughts?

-- Steven Walling, Product Manager https://wikimediafoundation.org/

Aaron Halfaker

9:53 a.m.

+1 to Dario's mention of the many schemas that just capture production DB stuff in a better way.

Re. growth: Old growth experiment schemas continue to be a great resource for checking old work and sometimes even new hypotheses. When Dario and Kevin get around to us, I'll have a complete list of schemas that should not be purged.

Re. storage parameters in the Schema, I agree with Ori, but I'd still like to have them on the wiki somehow. If we were a bunch of Wikipedia editors, I'd suggest making a template for the talk page of a schema that captures this metadata. Given that a template would probably not be best and we'd probably like to stick to JSON, maybe a subpage would be in order.

E.g.

- Schema:Foo == data type JSON - Schema:Foo/restrictions == storage restrictions JSON (sampling, pruning, indexing, etc.) - Schema_talk:Foo == Discussion of Schema:Foo

Such a pattern would allow for changes to storage restrictions without changing the rev_id of the schema page (data type).

-Aaron

On Thu, May 29, 2014 at 1:26 AM, Steven Walling swalling@wikimedia.org wrote:

...

On Wed, May 28, 2014 at 10:50 AM, Dan Andreescu dandreescu@wikimedia.org wrote:

...
I just announced this potential change in Scrum of Scrums and the Mobile team said they also would like to keep old data, but not for all of their schemas. They're cleaning up their graphs and we should check with them when we start deleting.

Following up on this from the Growth perspective...

My main question is what the rationale is. Is it to improve query performance on analytics dbs?

I do know there are many older schemas for Growth-related experiments that are only really useful for historical analysis, which is kind of hard to reconstruct anyway. If there are sound technical reasons to chuck stuff from the relational dbs and retain it only in the raw JSON logs, then I'm potentially okay with helping figure out a list of schemas to retain and schemas to purge. Aaron, thoughts?

-- Steven Walling, Product Manager https://wikimediafoundation.org/

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Ori Livneh

30 May 30 May

1:28 a.m.

On Wed, May 28, 2014 at 11:26 PM, Steven Walling swalling@wikimedia.org wrote:

...

My main question is what the rationale is. Is it to improve query performance on analytics dbs?

I imagine it will help, but it's probably not the primary reason. I imagine Sean would like to have the database in a state of equilibrium such that there are no looming dangers, and no reason in principle why things couldn't just keep running. At the moment the clip of incoming events is prone to sharp fluctuations and there is no protocol in place for handling exhausted server capacity. Sean is our only(!) DBA and if the database server goes out, he gets paged, and it's on him to sort it out. We need to do better by him, IMO.

Sean Pringle

2:03 a.m.

On Fri, May 30, 2014 at 3:28 PM, Ori Livneh ori@wikimedia.org wrote:

...

On Wed, May 28, 2014 at 11:26 PM, Steven Walling swalling@wikimedia.org wrote:

...
My main question is what the rationale is. Is it to improve query performance on analytics dbs?

I imagine it will help, but it's probably not the primary reason. I imagine Sean would like to have the database in a state of equilibrium such that there are no looming dangers, and no reason in principle why things couldn't just keep running. At the moment the clip of incoming events is prone to sharp fluctuations and there is no protocol in place for handling exhausted server capacity.

Correct.

It's not really about performance since the dataset will be larger than $memory regardless.

Of course, if you guys decide that specific data needs to stay around for ever, that's fine; it helps with capacity planning and we just bite the bullet and ensure sufficient storage space is available. Having a default purge-after-X-months policy for new tables would be the baseline.

Nuria

2:33 a.m.

I see, I thought concern was privacy rather than capacity. In that case we should put in our backlog an item to short out schemas and find the ones whose data can be deleted. I will file an item to this extent.

In the future we hopefully have this metadata about the schema available somewhere.

On May 30, 2014, at 8:03 AM, Sean Pringle springle@wikimedia.org wrote:

...

On Fri, May 30, 2014 at 3:28 PM, Ori Livneh ori@wikimedia.org wrote:

...
On Wed, May 28, 2014 at 11:26 PM, Steven Walling swalling@wikimedia.org wrote:

...
My main question is what the rationale is. Is it to improve query performance on analytics dbs?

I imagine it will help, but it's probably not the primary reason. I imagine Sean would like to have the database in a state of equilibrium such that there are no looming dangers, and no reason in principle why things couldn't just keep running. At the moment the clip of incoming events is prone to sharp fluctuations and there is no protocol in place for handling exhausted server capacity.

Correct.

It's not really about performance since the dataset will be larger than $memory regardless.

Of course, if you guys decide that specific data needs to stay around for ever, that's fine; it helps with capacity planning and we just bite the bullet and ensure sufficient storage space is available. Having a default purge-after-X-months policy for new tables would be the baseline. _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Aaron Halfaker

11:03 a.m.

Nuria, I believe that Dario already did that[1].

1. https://trello.com/c/F0DsiSXn/305-audit-historical-el-data-for-retention

On Fri, May 30, 2014 at 1:33 AM, Nuria nuria@wikimedia.org wrote:

...

I see, I thought concern was privacy rather than capacity. In that case we should put in our backlog an item to short out schemas and find the ones whose data can be deleted. I will file an item to this extent.

In the future we hopefully have this metadata about the schema available somewhere.

On May 30, 2014, at 8:03 AM, Sean Pringle springle@wikimedia.org wrote:

On Fri, May 30, 2014 at 3:28 PM, Ori Livneh ori@wikimedia.org wrote:

...
On Wed, May 28, 2014 at 11:26 PM, Steven Walling swalling@wikimedia.org wrote:

...
My main question is what the rationale is. Is it to improve query performance on analytics dbs?

I imagine it will help, but it's probably not the primary reason. I imagine Sean would like to have the database in a state of equilibrium such that there are no looming dangers, and no reason in principle why things couldn't just keep running. At the moment the clip of incoming events is prone to sharp fluctuations and there is no protocol in place for handling exhausted server capacity.

Correct.

It's not really about performance since the dataset will be larger than $memory regardless.

Of course, if you guys decide that specific data needs to stay around for ever, that's fine; it helps with capacity planning and we just bite the bullet and ensure sufficient storage space is available. Having a default purge-after-X-months policy for new tables would be the baseline.

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Steven Walling

1:39 p.m.

On Thu, May 29, 2014 at 11:03 PM, Sean Pringle springle@wikimedia.org wrote:

...

On Fri, May 30, 2014 at 3:28 PM, Ori Livneh ori@wikimedia.org wrote:

...
On Wed, May 28, 2014 at 11:26 PM, Steven Walling swalling@wikimedia.org wrote:

...
My main question is what the rationale is. Is it to improve query performance on analytics dbs?

I imagine it will help, but it's probably not the primary reason. I imagine Sean would like to have the database in a state of equilibrium such that there are no looming dangers, and no reason in principle why things couldn't just keep running. At the moment the clip of incoming events is prone to sharp fluctuations and there is no protocol in place for handling exhausted server capacity.

Correct.

It's not really about performance since the dataset will be larger than $memory regardless.

Of course, if you guys decide that specific data needs to stay around for ever, that's fine; it helps with capacity planning and we just bite the bullet and ensure sufficient storage space is available. Having a default purge-after-X-months policy for new tables would be the baseline.

Thanks for the explanation guys. This makes perfect sense to me. I'd much rather have old data be something we have to dig a little harder for, than worry if current schemas are going to be accessible or not.

-- Steven Walling, Product Manager https://wikimediafoundation.org/

3874

Age (days ago)

3883

Last active (days ago)

analytics@lists.wikimedia.org

15 comments

8 participants

tags (0)

participants (8)

Aaron Halfaker
Dan Andreescu
Dario Taraborelli
Nuria
Nuria Ruiz
Ori Livneh
Sean Pringle
Steven Walling