Sure, it doesn't have space problems, but the problem remains that with a table this large, it's impossible to query and get results in our lifetime.
I see, makes sense.

I think in this case moving all of the data to Hadoop and blacklisting it from the mysql inserter seems like the right thing to do.
I agree. We should implement partial auto-purging in Hadoop though. In the Echo schema some fields should still be purged.

On Wed, Dec 16, 2015 at 3:07 PM, Dan Andreescu <dandreescu@wikimedia.org> wrote:
Just spoke with Jaime Crespo and he confirmed that:
  • m4-master (master EL database) only holds events for the last 45 days to avoid space problems. That's for all tables including Echo.

  • analytics-storage is the replica that keeps the historical data and is meant to apply the specific purging strategy agreed in the schema's talk page. This database does not have space problems (yet).
Sure, it doesn't have space problems, but the problem remains that with a table this large, it's impossible to query and get results in our lifetime.  So we need to come up with some better solutions where we have these huge volumes of valuable data.  I think in this case moving all of the data to Hadoop and blacklisting it from the mysql inserter seems like the right thing to do.  The only reason for data to exist in mysql should be if we're querying data on a frequent period basis and taking actions based on the results of those queries.  Otherwise it's a waste of resources and we should allocate that disk space to something else. 

_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics




--
Marcel Ruiz Forns
Analytics Developer
Wikimedia Foundation