Roan:
The data for Echo schema(https://meta.wikimedia.org/wiki/Schema:Echo) is quite large and we are not sure is even used.
Can you confirm either way? If it is no longer used we will stop collecting it.
Thanks,
Nuria
Hi Nuria,
FWIW: Although I'm not using this right now, but I could see it being useful for understanding the impact of new notification updates that are coming down the pike.[1][2]
What are the costs involved in keeping this schema up?
Best, J
1. https://meta.wikimedia.org/wiki/Research:Cross-wiki_notifications_user_resea... 2. https://phabricator.wikimedia.org/T116741
On Tue, Dec 15, 2015 at 8:22 AM, Nuria Ruiz nuria@wikimedia.org wrote:
Roan:
The data for Echo schema(https://meta.wikimedia.org/wiki/Schema:Echo) is quite large and we are not sure is even used.
Can you confirm either way? If it is no longer used we will stop collecting it.
Thanks,
Nuria
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
What are the costs involved in keeping this schema up?
Well, usage of database space in a not so smart manner (huge tables that become unquery-able basically). This table is now 9G and doesn't look like anyone is looking at this data.
On Tue, Dec 15, 2015 at 9:22 AM, Jonathan Morgan jmorgan@wikimedia.org wrote:
Hi Nuria,
FWIW: Although I'm not using this right now, but I could see it being useful for understanding the impact of new notification updates that are coming down the pike.[1][2]
What are the costs involved in keeping this schema up?
Best, J
https://meta.wikimedia.org/wiki/Research:Cross-wiki_notifications_user_resea... 2. https://phabricator.wikimedia.org/T116741
On Tue, Dec 15, 2015 at 8:22 AM, Nuria Ruiz nuria@wikimedia.org wrote:
Roan:
The data for Echo schema(https://meta.wikimedia.org/wiki/Schema:Echo) is quite large and we are not sure is even used.
Can you confirm either way? If it is no longer used we will stop collecting it.
Thanks,
Nuria
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Jonathan T. Morgan Senior Design Researcher Wikimedia Foundation User:Jmorgan (WMF) https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
We could blacklist this schema from the mysql database, and still keep producing it. It would be available in Hadoop either way.
On Dec 15, 2015, at 12:22, Jonathan Morgan jmorgan@wikimedia.org wrote:
Hi Nuria,
FWIW: Although I'm not using this right now, but I could see it being useful for understanding the impact of new notification updates that are coming down the pike.[1][2]
What are the costs involved in keeping this schema up?
Best, J
- https://meta.wikimedia.org/wiki/Research:Cross-wiki_notifications_user_resea... https://meta.wikimedia.org/wiki/Research:Cross-wiki_notifications_user_research
- https://phabricator.wikimedia.org/T116741 https://phabricator.wikimedia.org/T116741
On Tue, Dec 15, 2015 at 8:22 AM, Nuria Ruiz <nuria@wikimedia.org mailto:nuria@wikimedia.org> wrote: Roan:
The data for Echo schema(https://meta.wikimedia.org/wiki/Schema:Echo https://meta.wikimedia.org/wiki/Schema:Echo) is quite large and we are not sure is even used.
Can you confirm either way? If it is no longer used we will stop collecting it.
Thanks,
Nuria
Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
-- Jonathan T. Morgan Senior Design Researcher Wikimedia Foundation User:Jmorgan (WMF) https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
We could blacklist this schema from the mysql database, and still keep
producing it. It would be available in Hadoop either way.
Right but I would also like to drop the table if it is not being used, if data is not going to be looked at soonish there is no point in storing as it will likely be deleted before it gets looked at.
Thanks,
Nuria
On Tue, Dec 15, 2015 at 9:35 AM, Andrew Otto aotto@wikimedia.org wrote:
We could blacklist this schema from the mysql database, and still keep producing it. It would be available in Hadoop either way.
On Dec 15, 2015, at 12:22, Jonathan Morgan jmorgan@wikimedia.org wrote:
Hi Nuria,
FWIW: Although I'm not using this right now, but I could see it being useful for understanding the impact of new notification updates that are coming down the pike.[1][2]
What are the costs involved in keeping this schema up?
Best, J
https://meta.wikimedia.org/wiki/Research:Cross-wiki_notifications_user_resea... 2. https://phabricator.wikimedia.org/T116741
On Tue, Dec 15, 2015 at 8:22 AM, Nuria Ruiz nuria@wikimedia.org wrote:
Roan:
The data for Echo schema(https://meta.wikimedia.org/wiki/Schema:Echo) is quite large and we are not sure is even used.
Can you confirm either way? If it is no longer used we will stop collecting it.
Thanks,
Nuria
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Jonathan T. Morgan Senior Design Researcher Wikimedia Foundation User:Jmorgan (WMF) https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Hi Nuria!
Speaking for *my own particular scenario*, that solution sounds like it will be fine, since I don't plan on immediately performing research with these data.
But it's obviously still the Collab team's call here--they likely have needs I know nothing about. Cc'ing Joe Matazzoni in case he's not following this already...
J
On Tue, Dec 15, 2015 at 9:50 AM, Nuria Ruiz nuria@wikimedia.org wrote:
We could blacklist this schema from the mysql database, and still keep
producing it. It would be available in Hadoop either way.
Right but I would also like to drop the table if it is not being used, if data is not going to be looked at soonish there is no point in storing as it will likely be deleted before it gets looked at.
Thanks,
Nuria
On Tue, Dec 15, 2015 at 9:35 AM, Andrew Otto aotto@wikimedia.org wrote:
We could blacklist this schema from the mysql database, and still keep producing it. It would be available in Hadoop either way.
On Dec 15, 2015, at 12:22, Jonathan Morgan jmorgan@wikimedia.org wrote:
Hi Nuria,
FWIW: Although I'm not using this right now, but I could see it being useful for understanding the impact of new notification updates that are coming down the pike.[1][2]
What are the costs involved in keeping this schema up?
Best, J
https://meta.wikimedia.org/wiki/Research:Cross-wiki_notifications_user_resea... 2. https://phabricator.wikimedia.org/T116741
On Tue, Dec 15, 2015 at 8:22 AM, Nuria Ruiz nuria@wikimedia.org wrote:
Roan:
The data for Echo schema(https://meta.wikimedia.org/wiki/Schema:Echo) is quite large and we are not sure is even used.
Can you confirm either way? If it is no longer used we will stop collecting it.
Thanks,
Nuria
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Jonathan T. Morgan Senior Design Researcher Wikimedia Foundation User:Jmorgan (WMF) https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
[Reviving old thread]
I was looking at our EventLogging data today, and discovered that Schema:Edit contains no useful information that isn't already in the database apart from which button people use to thank each other, and if we really care about that we can measure it separately without producing nine gigs of unused data.
Feel free to delete the data associated with Schema:Echo (but not Schema:EchoInteraction! We do use that one) with extreme prejudice. I've also written a config patch to stop us from producing these events ( https://gerrit.wikimedia.org/r/#/c/274345/ ) which I will deploy in the SWAT on Thursday.
I also found that a long-standing issue with duplicate events in Schema:EchoInteraction wasn't fixed yet, so I wrote a patch for that too: https://gerrit.wikimedia.org/r/274342
On Tue, Dec 15, 2015 at 11:16 AM, Jonathan Morgan jmorgan@wikimedia.org wrote:
Hi Nuria!
Speaking for *my own particular scenario*, that solution sounds like it will be fine, since I don't plan on immediately performing research with these data.
But it's obviously still the Collab team's call here--they likely have needs I know nothing about. Cc'ing Joe Matazzoni in case he's not following this already...
J
On Tue, Dec 15, 2015 at 9:50 AM, Nuria Ruiz nuria@wikimedia.org wrote:
We could blacklist this schema from the mysql database, and still keep
producing it. It would be available in Hadoop either way.
Right but I would also like to drop the table if it is not being used, if data is not going to be looked at soonish there is no point in storing as it will likely be deleted before it gets looked at.
Thanks,
Nuria
On Tue, Dec 15, 2015 at 9:35 AM, Andrew Otto aotto@wikimedia.org wrote:
We could blacklist this schema from the mysql database, and still keep producing it. It would be available in Hadoop either way.
On Dec 15, 2015, at 12:22, Jonathan Morgan jmorgan@wikimedia.org wrote:
Hi Nuria,
FWIW: Although I'm not using this right now, but I could see it being useful for understanding the impact of new notification updates that are coming down the pike.[1][2]
What are the costs involved in keeping this schema up?
Best, J
https://meta.wikimedia.org/wiki/Research:Cross-wiki_notifications_user_resea... 2. https://phabricator.wikimedia.org/T116741
On Tue, Dec 15, 2015 at 8:22 AM, Nuria Ruiz nuria@wikimedia.org wrote:
Roan:
The data for Echo schema(https://meta.wikimedia.org/wiki/Schema:Echo) is quite large and we are not sure is even used.
Can you confirm either way? If it is no longer used we will stop collecting it.
Thanks,
Nuria
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Jonathan T. Morgan Senior Design Researcher Wikimedia Foundation User:Jmorgan (WMF) https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Jonathan T. Morgan Senior Design Researcher Wikimedia Foundation User:Jmorgan (WMF) https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)
*Schema:Edit contains no useful information that isn't already in the database apart from which button people use to thank each other,*
I assume you mean Schema:Echo? :)
On Tue, Mar 1, 2016 at 11:58 PM, Roan Kattouw rkattouw@wikimedia.org wrote:
[Reviving old thread]
I was looking at our EventLogging data today, and discovered that Schema:Edit contains no useful information that isn't already in the database apart from which button people use to thank each other, and if we really care about that we can measure it separately without producing nine gigs of unused data.
Feel free to delete the data associated with Schema:Echo (but not Schema:EchoInteraction! We do use that one) with extreme prejudice. I've also written a config patch to stop us from producing these events ( https://gerrit.wikimedia.org/r/#/c/274345/ ) which I will deploy in the SWAT on Thursday.
I also found that a long-standing issue with duplicate events in Schema:EchoInteraction wasn't fixed yet, so I wrote a patch for that too: https://gerrit.wikimedia.org/r/274342
On Tue, Dec 15, 2015 at 11:16 AM, Jonathan Morgan jmorgan@wikimedia.org wrote:
Hi Nuria!
Speaking for *my own particular scenario*, that solution sounds like it will be fine, since I don't plan on immediately performing research with these data.
But it's obviously still the Collab team's call here--they likely have needs I know nothing about. Cc'ing Joe Matazzoni in case he's not following this already...
J
On Tue, Dec 15, 2015 at 9:50 AM, Nuria Ruiz nuria@wikimedia.org wrote:
We could blacklist this schema from the mysql database, and still keep
producing it. It would be available in Hadoop either way.
Right but I would also like to drop the table if it is not being used, if data is not going to be looked at soonish there is no point in storing as it will likely be deleted before it gets looked at.
Thanks,
Nuria
On Tue, Dec 15, 2015 at 9:35 AM, Andrew Otto aotto@wikimedia.org wrote:
We could blacklist this schema from the mysql database, and still keep producing it. It would be available in Hadoop either way.
On Dec 15, 2015, at 12:22, Jonathan Morgan jmorgan@wikimedia.org wrote:
Hi Nuria,
FWIW: Although I'm not using this right now, but I could see it being useful for understanding the impact of new notification updates that are coming down the pike.[1][2]
What are the costs involved in keeping this schema up?
Best, J
https://meta.wikimedia.org/wiki/Research:Cross-wiki_notifications_user_resea... 2. https://phabricator.wikimedia.org/T116741
On Tue, Dec 15, 2015 at 8:22 AM, Nuria Ruiz nuria@wikimedia.org wrote:
Roan:
The data for Echo schema(https://meta.wikimedia.org/wiki/Schema:Echo) is quite large and we are not sure is even used.
Can you confirm either way? If it is no longer used we will stop collecting it.
Thanks,
Nuria
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Jonathan T. Morgan Senior Design Researcher Wikimedia Foundation User:Jmorgan (WMF) https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Jonathan T. Morgan Senior Design Researcher Wikimedia Foundation User:Jmorgan (WMF) https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
On Wed, Mar 2, 2016 at 9:34 AM, Neil P. Quinn nquinn@wikimedia.org wrote:
*Schema:Edit contains no useful information that isn't already in the
database apart from which button people use to thank each other,*
I assume you mean Schema:Echo? :)
YES. Yes. ECHO, not Edit.
I saw myself make this mistake in the other paragraph and fixed it, but apparently I missed that one. I should write fewer emails at midnight.
Dear analytics people, please don't delete Schema:Edit or Neil will be upset :) . But kill Schema:Echo with fire.
Ok, Schema_talk page updated and task filed with Jaime cc-ed (he's the one that has the permits to do this): https://meta.wikimedia.org/wiki/Schema_talk:Echo
I was about to say I'm cc-ing the analytics list when I see that, apparently, Roan's email address is analytics@lists.wikimedia.org. Huh? Lol :)
On Wed, Mar 2, 2016 at 1:11 PM, Dan Andreescu dandreescu@wikimedia.org wrote:
K, I'll delete Schema:Edit
:) just kidding
Ok so we will just set the policy for Schema:Echo to purge after 90 days, so the data will delete itself and give yall time to do any last queries you might want.
*From: *Roan Kattouw *Sent: *Wednesday, March 2, 2016 12:41 *To: *Neil P. Quinn *Reply To: *A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. *Cc: *A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. *Subject: *Re: [Analytics] Echo schema eventlogging
On Wed, Mar 2, 2016 at 9:34 AM, Neil P. Quinn nquinn@wikimedia.org wrote:
*Schema:Edit contains no useful information that isn't already in the
database apart from which button people use to thank each other,*
I assume you mean Schema:Echo? :)
YES. Yes. ECHO, not Edit.
I saw myself make this mistake in the other paragraph and fixed it, but apparently I missed that one. I should write fewer emails at midnight.
Dear analytics people, please don't delete Schema:Edit or Neil will be upset :) . But kill Schema:Echo with fire.
If the data is going to be retained but would just become harder to query (i.e. still in Hadoop but not in mysql), maybe we could nuke data that's more than a year old (or 6 months old or something) from mysql?
On Tue, Dec 15, 2015 at 9:35 AM, Andrew Otto aotto@wikimedia.org wrote:
We could blacklist this schema from the mysql database, and still keep producing it. It would be available in Hadoop either way.
On Dec 15, 2015, at 12:22, Jonathan Morgan jmorgan@wikimedia.org wrote:
Hi Nuria,
FWIW: Although I'm not using this right now, but I could see it being useful for understanding the impact of new notification updates that are coming down the pike.[1][2]
What are the costs involved in keeping this schema up?
Best, J
https://meta.wikimedia.org/wiki/Research:Cross-wiki_notifications_user_resea... 2. https://phabricator.wikimedia.org/T116741
On Tue, Dec 15, 2015 at 8:22 AM, Nuria Ruiz nuria@wikimedia.org wrote:
Roan:
The data for Echo schema(https://meta.wikimedia.org/wiki/Schema:Echo) is quite large and we are not sure is even used.
Can you confirm either way? If it is no longer used we will stop collecting it.
Thanks,
Nuria
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Jonathan T. Morgan Senior Design Researcher Wikimedia Foundation User:Jmorgan (WMF) https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
maybe we could nuke data that's more than a year old (or 6 months old or
something) from mysql?
With eventlogging data we "normally" drop data that is older than 90 days, will this work?
Thanks for the prompt response.
On Tue, Dec 15, 2015 at 11:27 AM, Roan Kattouw rkattouw@wikimedia.org wrote:
If the data is going to be retained but would just become harder to query (i.e. still in Hadoop but not in mysql), maybe we could nuke data that's more than a year old (or 6 months old or something) from mysql?
On Tue, Dec 15, 2015 at 9:35 AM, Andrew Otto aotto@wikimedia.org wrote:
We could blacklist this schema from the mysql database, and still keep producing it. It would be available in Hadoop either way.
On Dec 15, 2015, at 12:22, Jonathan Morgan jmorgan@wikimedia.org wrote:
Hi Nuria,
FWIW: Although I'm not using this right now, but I could see it being useful for understanding the impact of new notification updates that are coming down the pike.[1][2]
What are the costs involved in keeping this schema up?
Best, J
https://meta.wikimedia.org/wiki/Research:Cross-wiki_notifications_user_resea... 2. https://phabricator.wikimedia.org/T116741
On Tue, Dec 15, 2015 at 8:22 AM, Nuria Ruiz nuria@wikimedia.org wrote:
Roan:
The data for Echo schema(https://meta.wikimedia.org/wiki/Schema:Echo) is quite large and we are not sure is even used.
Can you confirm either way? If it is no longer used we will stop collecting it.
Thanks,
Nuria
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Jonathan T. Morgan Senior Design Researcher Wikimedia Foundation User:Jmorgan (WMF) https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
I want to mention that data in Hadoop is only available from Aug 27th 2015. Older data is only available in mysql.
On Tue, Dec 15, 2015 at 11:27 AM, Roan Kattouw rkattouw@wikimedia.org wrote:
If the data is going to be retained but would just become harder to query (i.e. still in Hadoop but not in mysql), maybe we could nuke data that's more than a year old (or 6 months old or something) from mysql?
On Tue, Dec 15, 2015 at 9:35 AM, Andrew Otto aotto@wikimedia.org wrote:
We could blacklist this schema from the mysql database, and still keep producing it. It would be available in Hadoop either way.
On Dec 15, 2015, at 12:22, Jonathan Morgan jmorgan@wikimedia.org wrote:
Hi Nuria,
FWIW: Although I'm not using this right now, but I could see it being useful for understanding the impact of new notification updates that are coming down the pike.[1][2]
What are the costs involved in keeping this schema up?
Best, J
https://meta.wikimedia.org/wiki/Research:Cross-wiki_notifications_user_resea... 2. https://phabricator.wikimedia.org/T116741
On Tue, Dec 15, 2015 at 8:22 AM, Nuria Ruiz nuria@wikimedia.org wrote:
Roan:
The data for Echo schema(https://meta.wikimedia.org/wiki/Schema:Echo) is quite large and we are not sure is even used.
Can you confirm either way? If it is no longer used we will stop collecting it.
Thanks,
Nuria
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Jonathan T. Morgan Senior Design Researcher Wikimedia Foundation User:Jmorgan (WMF) https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
No! Please do not nuke old data. +1 to J-Mo. This will probably be useful for long-term studies of notifications. If I had the time, I'd pick it up right now based on this reminder!
I'm happy with having historical data preserved (please makes sure that it is) and the MySQL table dropped until a recent point. It will be important that we can come back to this later and either restore the data or query it in it's entirety from hadoop.
-Aaron
On Tue, Dec 15, 2015 at 1:34 PM, Madhumitha Viswanathan < mviswanathan@wikimedia.org> wrote:
I want to mention that data in Hadoop is only available from Aug 27th 2015. Older data is only available in mysql.
On Tue, Dec 15, 2015 at 11:27 AM, Roan Kattouw rkattouw@wikimedia.org wrote:
If the data is going to be retained but would just become harder to query (i.e. still in Hadoop but not in mysql), maybe we could nuke data that's more than a year old (or 6 months old or something) from mysql?
On Tue, Dec 15, 2015 at 9:35 AM, Andrew Otto aotto@wikimedia.org wrote:
We could blacklist this schema from the mysql database, and still keep producing it. It would be available in Hadoop either way.
On Dec 15, 2015, at 12:22, Jonathan Morgan jmorgan@wikimedia.org wrote:
Hi Nuria,
FWIW: Although I'm not using this right now, but I could see it being useful for understanding the impact of new notification updates that are coming down the pike.[1][2]
What are the costs involved in keeping this schema up?
Best, J
https://meta.wikimedia.org/wiki/Research:Cross-wiki_notifications_user_resea... 2. https://phabricator.wikimedia.org/T116741
On Tue, Dec 15, 2015 at 8:22 AM, Nuria Ruiz nuria@wikimedia.org wrote:
Roan:
The data for Echo schema(https://meta.wikimedia.org/wiki/Schema:Echo) is quite large and we are not sure is even used.
Can you confirm either way? If it is no longer used we will stop collecting it.
Thanks,
Nuria
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Jonathan T. Morgan Senior Design Researcher Wikimedia Foundation User:Jmorgan (WMF) https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- --Madhu :)
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Just spoke with Jaime Crespo and he confirmed that:
- m4-master (master EL database) only holds events for the last 45 days to avoid space problems. That's for all tables including Echo.
- analytics-storage is the replica that keeps the historical data and is meant to apply the specific purging strategy agreed in the schema's talk page. This database does not have space problems (yet).
On Wed, Dec 16, 2015 at 2:14 AM, Aaron Halfaker ahalfaker@wikimedia.org wrote:
No! Please do not nuke old data. +1 to J-Mo. This will probably be useful for long-term studies of notifications. If I had the time, I'd pick it up right now based on this reminder!
I'm happy with having historical data preserved (please makes sure that it is) and the MySQL table dropped until a recent point. It will be important that we can come back to this later and either restore the data or query it in it's entirety from hadoop.
-Aaron
On Tue, Dec 15, 2015 at 1:34 PM, Madhumitha Viswanathan < mviswanathan@wikimedia.org> wrote:
I want to mention that data in Hadoop is only available from Aug 27th 2015. Older data is only available in mysql.
On Tue, Dec 15, 2015 at 11:27 AM, Roan Kattouw rkattouw@wikimedia.org wrote:
If the data is going to be retained but would just become harder to query (i.e. still in Hadoop but not in mysql), maybe we could nuke data that's more than a year old (or 6 months old or something) from mysql?
On Tue, Dec 15, 2015 at 9:35 AM, Andrew Otto aotto@wikimedia.org wrote:
We could blacklist this schema from the mysql database, and still keep producing it. It would be available in Hadoop either way.
On Dec 15, 2015, at 12:22, Jonathan Morgan jmorgan@wikimedia.org wrote:
Hi Nuria,
FWIW: Although I'm not using this right now, but I could see it being useful for understanding the impact of new notification updates that are coming down the pike.[1][2]
What are the costs involved in keeping this schema up?
Best, J
https://meta.wikimedia.org/wiki/Research:Cross-wiki_notifications_user_resea... 2. https://phabricator.wikimedia.org/T116741
On Tue, Dec 15, 2015 at 8:22 AM, Nuria Ruiz nuria@wikimedia.org wrote:
Roan:
The data for Echo schema(https://meta.wikimedia.org/wiki/Schema:Echo) is quite large and we are not sure is even used.
Can you confirm either way? If it is no longer used we will stop collecting it.
Thanks,
Nuria
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Jonathan T. Morgan Senior Design Researcher Wikimedia Foundation User:Jmorgan (WMF) https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- --Madhu :)
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Just spoke with Jaime Crespo and he confirmed that:
- m4-master (master EL database) only holds events for the last 45
days to avoid space problems. That's for all tables including Echo.
- analytics-storage is the replica that keeps the historical data and
is meant to apply the specific purging strategy agreed in the schema's talk page. This database does not have space problems (yet).
Sure, it doesn't have space problems, but the problem remains that with a
table this large, it's impossible to query and get results in our lifetime. So we need to come up with some better solutions where we have these huge volumes of valuable data. I think in this case moving all of the data to Hadoop and blacklisting it from the mysql inserter seems like the right thing to do. The only reason for data to exist in mysql should be if we're querying data on a frequent period basis and taking actions based on the results of those queries. Otherwise it's a waste of resources and we should allocate that disk space to something else.
Sure, it doesn't have space problems, but the problem remains that with a table this large, it's impossible to query and get results in our lifetime.
I see, makes sense.
I think in this case moving all of the data to Hadoop and blacklisting it
from the mysql inserter seems like the right thing to do.
I agree. We should implement partial auto-purging in Hadoop though. In the Echo schema some fields should still be purged.
On Wed, Dec 16, 2015 at 3:07 PM, Dan Andreescu dandreescu@wikimedia.org wrote:
Just spoke with Jaime Crespo and he confirmed that:
- m4-master (master EL database) only holds events for the last 45
days to avoid space problems. That's for all tables including Echo.
- analytics-storage is the replica that keeps the historical data and
is meant to apply the specific purging strategy agreed in the schema's talk page. This database does not have space problems (yet).
Sure, it doesn't have space problems, but the problem remains that with a
table this large, it's impossible to query and get results in our lifetime. So we need to come up with some better solutions where we have these huge volumes of valuable data. I think in this case moving all of the data to Hadoop and blacklisting it from the mysql inserter seems like the right thing to do. The only reason for data to exist in mysql should be if we're querying data on a frequent period basis and taking actions based on the results of those queries. Otherwise it's a waste of resources and we should allocate that disk space to something else.
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
I think in this case moving all of the data to Hadoop and blacklisting it
from the mysql inserter seems like the right thing to do.
I agree. We should implement partial auto-purging in Hadoop though. In the
Echo schema some fields should still be purged. Right, being able to move all this data to hadoop is contingent on having a purging strategy. Filed ticket to do this:
https://phabricator.wikimedia.org/T121657
On Wed, Dec 16, 2015 at 6:55 AM, Marcel Ruiz Forns mforns@wikimedia.org wrote:
Sure, it doesn't have space problems, but the problem remains that with a
table this large, it's impossible to query and get results in our lifetime.
I see, makes sense.
I think in this case moving all of the data to Hadoop and blacklisting it
from the mysql inserter seems like the right thing to do.
I agree. We should implement partial auto-purging in Hadoop though. In the Echo schema some fields should still be purged.
On Wed, Dec 16, 2015 at 3:07 PM, Dan Andreescu dandreescu@wikimedia.org wrote:
Just spoke with Jaime Crespo and he confirmed that:
- m4-master (master EL database) only holds events for the last 45
days to avoid space problems. That's for all tables including Echo.
- analytics-storage is the replica that keeps the historical data
and is meant to apply the specific purging strategy agreed in the schema's talk page. This database does not have space problems (yet).
Sure, it doesn't have space problems, but the problem remains that with
a table this large, it's impossible to query and get results in our lifetime. So we need to come up with some better solutions where we have these huge volumes of valuable data. I think in this case moving all of the data to Hadoop and blacklisting it from the mysql inserter seems like the right thing to do. The only reason for data to exist in mysql should be if we're querying data on a frequent period basis and taking actions based on the results of those queries. Otherwise it's a waste of resources and we should allocate that disk space to something else.
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- *Marcel Ruiz Forns* Analytics Developer Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics