Re: [Analytics] A new format for the pageview dumps

25 Mar 2015


      Hm.. interesting. A few random ideas:
I suppose extra columns wouldn't have to break existing scripts.
When querying total count, columns on one line would make it easier when we add columns if their logic can do "sum any columns after title". Which means if we add another column (which to some degree, means fragmenting traffic into more buckets, so you'd want to add up whatever columns there are at all times) their total count would stay accurate.
When querying for a specific type of traffic in the future (e.g. only traffic from *.applewatch.*.org ::troll::) one would have to be careful to account for the column not existing in older dumps.
Perhaps:
* add a column for total count per title (adding up N+2 columns feels like something query tools generally would not support).
* give each type of traffic its own column (pv_desktop, pv_mobile, pv_zero, pv_app).
Or perhaps
* go back to separate lines for each traffic type (requiring users to be aware of the different types of traffic and their url permutation; or possibly we could use a normalised url and a dedicated column for traffic type).
This would mean users don't have to know the url or special identifier permutations and can simply skip that column if they want total counts. Very query-friendly.
canonical_project_hostname - traffic_source - encoded_title - pageviews_count
de.wikivoyage.org, desktop, "Florence", 920
de.wikivoyage.org. mobile, "Florence", 32
/me goes back to lurking from the bushes,
— Timo
On 15 Mar 2015, at 15:44, Aaron Halfaker ahalfaker@wikimedia.org wrote:
...
It seems that this schema will need to be modified to include new types of view counts.  For example, it doesn't seem like views via the Wikipedia App are included (or maybe they are within "mobile").  If we wanted to have them separate, we'd need to add another column -- changing the data format/schema.  For any additional sources of views that we'd need to separate in the future, we'd need to add another column again.
So, in a way, this format is less 'normalized' than the previous which would have out the counts from different sources on different lines.  This will require a data consumer who wants to process historic files to be able to handle files with differing numbers of columns and think of the incoming data as "full_project_url - title - desktop_counts - mobile_counts - ..." where "..." could contain <something> or <nothing> but must be handled regardless.
I think that this is undesirable -- but not *that* undesirable.
-Aaron
On Mar 14, 2015, at 14:00, Oliver Keyes wrote:
...
[..]
What I'd like to use as a new format is:
full_project_url - encoded_title - desktop_pageviews - mobile_and_zero_pageviews
This file would:

Include a header row;
Be formatted as a tab-separated, rather than space-separated, file;
Exclude bytecounts;
Include desktop and mobile pageview counts on the same line;
Use the full project URL ("en.wikivoyage.org") instead of the

pagecounts-specific notation ("en.v")
So, as a made-up example, instead of:
de.m.v Florence 32 9024
de.v Florence 920 7570
we'd end up with:
de.wikivoyage.org Florence 920 32

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Re: [Analytics] A new format for the pageview dumps