Hm, I don't think we will have much trouble with
the size of the input.
Well my post was also about how to store hourly data in a concise manner
(sparse array really), so we could serve hourly precision without too much
overhead.
Erik
From: analytics-bounces(a)lists.wikimedia.org
[mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Dan Andreescu
Sent: Wednesday, October 02, 2013 6:19 PM
To: A mailing list for the Analytics Team at WMF and everybody who has an
interest in Wikipedia and analytics.
Subject: Re: [Analytics] Back of the envelope data size for "Queryable
public interface for pageview data" [was: Re: Queryable public interface for
pageview data]
Here is shameless plug for existing data files in a highly condensed format,
which could serve as input for whatever database we choose.
Hm, I don't think we will have much trouble with the size of the input.
We're currently thinking of processing the hourly data through Hadoop, and
that shouldn't even blink at a few TB of data per day. What we'd like to
come to consensus on is the most useful output format. So far, I'm hearing
that monthly aggregates by page in a MySQL database is the bare minimum we
should release on day 1. We can then iterate and add any useful dimensions
to this data (like category information) or increase the resolution in
parallel tables. If the data becomes too large for MySQL we can look at
other databases to store the data in.
One goody of these files: they detect missing input files and correct
monthly counts to make up for this.
This is something we definitely want to port to Hadoop / use in some way.