It seems that this schema will need to be modified to include new types of view counts.  For example, it doesn't seem like views via the Wikipedia App are included (or maybe they are within "mobile").  If we wanted to have them separate, we'd need to add another column -- changing the data format/schema.  For any additional sources of views that we'd need to separate in the future, we'd need to add another column again.

So, in a way, this format is less 'normalized' than the previous which would have out the counts from different sources on different lines.  This will require a data consumer who wants to process historic files to be able to handle files with differing numbers of columns and think of the incoming data as "full_project_url - title - desktop_counts - mobile_counts - ..." where "..." could contain <something> or <nothing> but must be handled regardless.

I think that this is undesirable -- but not *that* undesirable.  

-Aaron





On Sat, Mar 14, 2015 at 5:58 PM, Oliver Keyes <okeyes@wikimedia.org> wrote:
Makes sense :). So, normalised titles at a minimum, and ideally both
normalised titles and pageID? (the disadvantage of just pageID is, of
course, having to look the darn thing up. The disadvantage of title is
that redirects that happen after-the-fact are a thing. Both would
solve for this).

On 14 March 2015 at 17:35, Roni Wiener <roni.wiener@keotic.com> wrote:
>
> Sounds great, I believe that normalization of the title will be very useful for future researchers and usages, so as adding the pageId.
> Currently it is not always straight forward to correlate the wikipedia page with the unnormalized title
>
>
>> On Mar 14, 2015, at 14:00, analytics-request@lists.wikimedia.org wrote:
>>
>> Send Analytics mailing list submissions to
>>    analytics@lists.wikimedia.org
>>
>> To subscribe or unsubscribe via the World Wide Web, visit
>>    https://lists.wikimedia.org/mailman/listinfo/analytics
>> or, via email, send a message with subject or body 'help' to
>>    analytics-request@lists.wikimedia.org
>>
>> You can reach the person managing the list at
>>    analytics-owner@lists.wikimedia.org
>>
>> When replying, please edit your Subject line so it is more specific
>> than "Re: Contents of Analytics digest..."
>>
>>
>> Today's Topics:
>>
>>   1. [Technical][Request for Comment] A new format for the
>>      pageview dumps (Oliver Keyes)
>>
>>
>> ----------------------------------------------------------------------
>>
>> Message: 1
>> Date: Fri, 13 Mar 2015 15:06:00 -0400
>> From: Oliver Keyes <okeyes@wikimedia.org>
>> To: "A mailing list for the Analytics Team at WMF and everybody who
>>    has an    interest in Wikipedia and analytics."
>>    <analytics@lists.wikimedia.org>,    Research into Wikimedia content and
>>    communities    <wiki-research-l@lists.wikimedia.org>
>> Subject: [Analytics] [Technical][Request for Comment] A new format for
>>    the    pageview dumps
>> Message-ID:
>>    <CAAUQgdCSVg8htCS4VFDZJaL2rUEjmk5ZEA+zXcp3o9OWn-udfQ@mail.gmail.com>
>> Content-Type: text/plain; charset=UTF-8
>>
>> So, we've got a new pageviews definition; it's nicely integrated and
>> spitting out TRUE/FALSE values on each row with the best of em. But
>> what does that mean for third-party researchers?
>>
>> Well...not much, at the moment, because the data isn't being released
>> somewhere. But one resource we do have that third-parties use a heck
>> of a lot, is the per-page pageviews dumps on dumps.wikimedia.org.
>>
>> Due to historical size constrains and decision-making (and by
>> historical I mean: last decade) these have a number of weirdnesses in
>> formatting terms; project identification is done using a notation
>> style not really used anywhere else, mobile/zero/desktop appear on
>> different lines, and the files are space-separated. I'd like to put
>> some volunteer time into spitting out dumps in an easier-to-work-with
>> format, using the new definition, to run in /parallel/ with the
>> existing logs.
>>
>> *The new format*
>> At the moment we have the format:
>>
>> project_notation - encoded_title - pageviews - bytes
>>
>> This puts zero and mobile requests to pageX in a different place to
>> desktop requests, requires some reconstruction of project_notation,
>> and contains (for some use cases) extraneous information - that being
>> the byte-count. The files are also headerless, unquoted and
>> space-separated, which saves space but is sometimes...I think the term
>> is "eeeeh-inducing".
>>
>> What I'd like to use as a new format is:
>>
>> full_project_url - encoded_title - desktop_pageviews - mobile_and_zero_pageviews
>>
>> This file would:
>>
>> 1. Include a header row;
>> 2. Be formatted as a tab-separated, rather than space-separated, file;
>> 3. Exclude bytecounts;
>> 4. Include desktop and mobile pageview counts on the same line;
>> 5. Use the full project URL ("en.wikivoyage.org") instead of the
>> pagecounts-specific notation ("en.v")
>>
>> So, as a made-up example, instead of:
>>
>> de.m.v Florence 32 9024
>> de.v Florence 920 7570
>>
>> we'd end up with:
>>
>> de.wikivoyage.org Florence 920 32
>>
>> In the future we could also work to /normalise/ the title - replacing
>> it with the page title that refers to the actual pageID. This won't
>> impact legacy files, and is currently blocked on the Apps team, but
>> should be viable as soon as that blocker goes away.
>>
>> I've written a script capable of parsing and reformatting the legacy
>> files, so we should be able to backfill in this new format too, if
>> that's wanted (see below).
>>
>> *The size constraints*
>>
>> There really aren't any. Like I said, the historical rationale for a
>> lot of these decisions seems to have been keeping the files small. But
>> by putting requests to the same title from different site versions on
>> the same line, and dropping byte-count, we save enough space that the
>> resulting files are approximately the same size as the old ones - or
>> in many cases, actually smaller.
>>
>> *What I'm asking for*
>>
>> Feedback! What do people think of the new format? What would they like
>> to see that they don't? What don't they need, here? How useful would
>> normalisation be? How useful would backfilling be?
>>
>> *What I'm not asking for*
>> WMF time! Like I said, this is a spare-time project; I've also got
>> volunteers for Code Review and checking, too (Yuvi and Otto).
>>
>> The replacement of the old files! Too many people depend on that
>> format and that definition, and I don't want to make them sad.
>>
>> Thoughts?
>>
>> --
>> Oliver Keyes
>> Research Analyst
>> Wikimedia Foundation
>>
>>
>>
>> ------------------------------
>>
>> _______________________________________________
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>> End of Analytics Digest, Vol 37, Issue 33
>> *****************************************
>
> _______________________________________________
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics



--
Oliver Keyes
Research Analyst
Wikimedia Foundation

_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics