There is a lot of cruft in the data files. Requests for non existing pages, e.g. due to typos.

I saved about 1/3 on the monthly aggregates by putting in a threshold of 5+ views per month.

 

In a database this could be done by a monthly cleanup job, which runs soon after month is complete.

This job calculates monthly totals and deletes articles for which this total is below a threshold.

 

Erik

 

From: analytics-bounces@lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Magnus Manske
Sent: Wednesday, October 02, 2013 1:11 AM
To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics.
Subject: Re: [Analytics] Back of the envelope data size for "Queryable public interface for pageview data" [was: Re: Queryable public interface for pageview data]

 

300GB per year should be trivial. It might increase slightly if we decide to use a relational database for everything; daily data could live in special flat files, and monthly could be in SQL.

In terms of hacks, we can use 2 bytes per date if we count days from Jan 1 2001 (day 1 A.W. = ante Wikipedia ;-) which would last for ~180 years. Down to 10 bytes already!

And, how about two tables, one with 4 byte view counts for "much-viewed" pages (few), and one with 2 bytes (<65.536 views/day) for the majority? Probably saving another 1.9 bytes/row here, at the cost of more complex low-level queries.

 

On Tue, Oct 1, 2013 at 11:52 PM, Christian Aistleitner <christian@quelltextlich.at> wrote:

Hi,

today there were some guesses about how much data we are actually
talking for the “queryable public interface for pageview data”. As no
one followed-up on that, I did some quick “back of the envelope”
computations.

The pagecounts files now come with ~7-9M data points/hour.

As we talked about computing "per day" data, some readings snuggle up
and collapse, and for the two days that I computed I arrived at ~70M
datapoints/day.

So each day we can expect ~70M new data points = ~70M new rows.

Now to the per row cost:

On Mon, Sep 30, 2013 at 01:57:58PM -0700, Diederik van Liere wrote:
> 2) Make simple datawarehouse schema for mysql db (based on the current
> webstatscollector datafiles)
>
> Page Table
> ==========
> page_id
> page title

Let's ignore that table for now.

> Fact Table
> ========
> fact_id

Let's ignore "fact_id" as it's more than the bare pagecount files
provide.

> page_id                    (FK -> Page Table)

As we're having 70M different rows/day, MEDIUMINT is too small.
Having an INT foreign key, we may also run out of numbers soon. But
let's also ignore that for now, and settle with INT = 4 bytes.

> pageview date

DATE is 3 bytes

> pageview count

Zipping through some random pagecount files, I saw values up to
~17M/hour. 17M is to small for MEDIUMINT, so we have to take at least
INT = 4 bytes. That also suffices for the per day sums.

> bytes served

Zipping through some random pagecount files, I saw values up to
~54G/hour. That's too big for INT. So let's settle for BIGINT = 8
bytes. That also suffices for the per day sums.

Summing up, we have 4+3+4+8 = 19 bytes per data point.

19 bytes * 70M * 365 days = 485 GB in 1 year.

And here the 19 bytes are really minimal. It will suffer overrun
ids. And id does not contain a row id. 70M is very, very
defensive. And the total size is without the page table, without
indices, without all the other stuff that comes in.

Have fun,
Christian

P.S.: There was some discussion whether or not we should drop "bytes
served", as "no one uses it". I am glad we did not second-guess the
requirements and took the field in, as that bought us a nice argument
for free, if we decide to drop/discuss it. "bytes served" is worth
~200GB in 1 year, when modelling it directly.

While such a decrease would help, it would not solve the problem, as
~285 GB data per year remain.

P.P.S.: Hope is not lost :-) There are of course thousands of ways to
get the size down.



--
---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ----
                           Companies' registry: 360296y in Linz
Christian Aistleitner
Gruendbergstrasze 65a        Email:  christian@quelltextlich.at
4040 Linz, Austria           Phone:          +43 732 / 26 95 63
                             Fax:            +43 732 / 26 95 63
                             Homepage: http://quelltextlich.at/
---------------------------------------------------------------

_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics




--

undefined