Re: [Analytics] Back of the envelope data size for "Queryable public interface for pageview data" [was: Re: Queryable public interface for pageview data]

2 Oct 2013

...
  Here is shameless plug for existing data files in a
highly condensed
 format, which could serve as input for whatever database we choose. ****

 **

Hm, I don't think we will have much trouble with the size of the input.
 We're currently thinking of processing the hourly data through Hadoop, and
that shouldn't even blink at a few TB of data per day.  What we'd like to
come to consensus on is the most useful output format.  So far, I'm hearing
that monthly aggregates by page in a MySQL database is the bare minimum we
should release on day 1.  We can then iterate and add any useful dimensions
to this data (like category information) or increase the resolution in
parallel tables.  If the data becomes too large for MySQL we can look at
other databases to store the data in.

...
  One goody of these files: they detect missing input
files and correct
 monthly counts to make up for this.

This is something we definitely want to port to Hadoop / use in some way.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Re: [Analytics] Back of the envelope data size for "Queryable public interface for pageview data" [was: Re: Queryable public interface for pageview data]