Hi all,
Growth team needs some data removed from both the raw logs and analytics slaves. Sean said he can help with the EventLogging db maintenance, but is unfamiliar with the logs on Vanadium.
This is to purge data from a recent set of experiments that involved setting a token for anonymous editors. Now that we've got our results and aggregated any non-private data we need in the future, we can safely remove any data stored in the associated schemas. This doesn't need to be selective based on schema ids or dates, we can probably just wholesale remove the associated schemas listed at https://meta.wikimedia.org/wiki/Research:Asking_anonymous_editors_to_registe...
Sean suggested Christian or Nuria might be best equipped to help here. If Aaron and I provide a list of the schemas, is this possible? Ideally, we'd like to delete these by 8/04, so apologies in advance for such a tight turnaround time.
Hi Steven,
On Tue, Jul 29, 2014 at 11:56:31AM -0700, Steven Walling wrote:
Growth team needs some data removed from [...] the raw logs [...]
Thanks for caring to clean up no longer needed data. It's greatly appreciated.
However, we typically do not scrub or clean the raw logs [1].
People are using those files for debugging and pushed back when we asked about whether we should clean them up.
Those raw files are only available to a limited set of people, so it is typically less of an issue.
Is it ok to just remove the data from databases (Thanks to Sean!) and let the data sit on the raw logs, or is there a hard requirement to scrub the raw logs clean too?
Have fun, Christian
[1] See 'Raw client and server side log files' item in http://lists.wikimedia.org/pipermail/analytics/2014-June/002256.html
I'd much rather we scrub the data completely as that's the plan that we put forward. In general, scrubbing the data from the DB, but not the raw logs serves no useful purpose.
-Aaron
On Wed, Jul 30, 2014 at 3:55 AM, Christian Aistleitner < christian@quelltextlich.at> wrote:
Hi Steven,
On Tue, Jul 29, 2014 at 11:56:31AM -0700, Steven Walling wrote:
Growth team needs some data removed from [...] the raw logs [...]
Thanks for caring to clean up no longer needed data. It's greatly appreciated.
However, we typically do not scrub or clean the raw logs [1].
People are using those files for debugging and pushed back when we asked about whether we should clean them up.
Those raw files are only available to a limited set of people, so it is typically less of an issue.
Is it ok to just remove the data from databases (Thanks to Sean!) and let the data sit on the raw logs, or is there a hard requirement to scrub the raw logs clean too?
Have fun, Christian
[1] See 'Raw client and server side log files' item in http://lists.wikimedia.org/pipermail/analytics/2014-June/002256.html
-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian@quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Hi Aaron,
On Wed, Jul 30, 2014 at 08:37:34AM -0500, Aaron Halfaker wrote:
In general, scrubbing the data from the DB, but not the raw logs serves no useful purpose.
It serves the useful purpose of not unnecessarily wasting developer cycles.
The raw logs are not used for real reasearch but only used for debugging the database. At least that's what some people (you included) told me.
So if there is no extra-special need, I won't do house-keeping on the raw logs.
Have fun, Christian
Hi Christian,
On Thu, Jul 31, 2014 at 9:51 AM, Christian Aistleitner < christian@quelltextlich.at> wrote:
So if there is no extra-special need, I won't do house-keeping on the raw logs.
Just to clarify, we're not purging in this case just to do housekeeping on our schemas. We're purging the data in this case because that is what we agreed to with Legal. These schemas depended on setting a unique token for anonymous editors and then tracking them further (including whether they registered an account). Like CheckUser logs, this is something we needed to answer a question, but which we would rather not keep beyond 90 days, and which we do not normally collect on the site.
Our team has never asked for a purge before, and I would never do it as a matter of routine or if I didn't think it was a priority. I understand that time is the most precious thing we all have, perhaps especially so at work. ;-) I also let Kevin know months ago we would need some help doing this from Analytics.
Hi Steven,
On Thu, Jul 31, 2014 at 10:00:56AM -0700, Steven Walling wrote:
On Thu, Jul 31, 2014 at 9:51 AM, Christian Aistleitner < christian@quelltextlich.at> wrote:
So if there is no extra-special need, [...]
[...] because that is what we agreed to with Legal. [...]
“Agreement with legal” qualifies as perfectly fine “extra-special need” for me :-)
Let's spend time removing the Schemas from the logs then.
Since you said “probably” in the OP when it came to the data to remove. ... is it sufficient to remove the six schemas from https://meta.wikimedia.org/wiki/Research:Asking_anonymous_editors_to_registe... ? Or are there further schemas?
I guess removal of the data for all days in the past, or only some period back?
From my point of view, we can take discussing of details off-list.
I also let Kevin know months ago we would need some help doing this from Analytics.
Kevin, please do chime in on such threads then :-)
Have fun, Christian
aha, I should have logged a bug a long time ago, but I was too much a newbie to know. Here it is: https://bugzilla.wikimedia.org/show_bug.cgi?id=68978
Christian: before I prioritize it, can you scope out how much work would be required?
thanks, Kevin
On Thu, Jul 31, 2014 at 4:08 PM, Christian Aistleitner < christian@quelltextlich.at> wrote:
Hi Steven,
On Thu, Jul 31, 2014 at 10:00:56AM -0700, Steven Walling wrote:
On Thu, Jul 31, 2014 at 9:51 AM, Christian Aistleitner < christian@quelltextlich.at> wrote:
So if there is no extra-special need, [...]
[...] because that is what we agreed to with Legal. [...]
“Agreement with legal” qualifies as perfectly fine “extra-special need” for me :-)
Let's spend time removing the Schemas from the logs then.
Since you said “probably” in the OP when it came to the data to remove. ... is it sufficient to remove the six schemas from
https://meta.wikimedia.org/wiki/Research:Asking_anonymous_editors_to_registe... ? Or are there further schemas?
I guess removal of the data for all days in the past, or only some period back?
From my point of view, we can take discussing of details off-list.
I also let Kevin know months ago we would need some help doing this from Analytics.
Kevin, please do chime in on such threads then :-)
Have fun, Christian
-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian@quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/
Hi Kevin,
[ moving list to bcc ]
On Thu, Jul 31, 2014 at 05:38:12PM -0700, Kevin Leduc wrote:
Christian: before I prioritize it, can you scope out how much work would be required?
To keep the on-list noise low, I replied on the corresponding bug https://bugzilla.wikimedia.org/show_bug.cgi?id=68931
Have fun, Christian