New subject: Re :A new format for the pageview dumps

14 Mar 2015

      Sounds great, I believe that normalization of the title will be very useful for future researchers and usages, so as adding the pageId.
Currently it is not always straight forward to correlate the wikipedia page with the unnormalized title
...
On Mar 14, 2015, at 14:00, analytics-request@lists.wikimedia.org wrote:
Send Analytics mailing list submissions to
   analytics@lists.wikimedia.org
To subscribe or unsubscribe via the World Wide Web, visit
   https://lists.wikimedia.org/mailman/listinfo/analytics
or, via email, send a message with subject or body 'help' to
   analytics-request@lists.wikimedia.org
You can reach the person managing the list at
   analytics-owner@lists.wikimedia.org
When replying, please edit your Subject line so it is more specific
than "Re: Contents of Analytics digest..."
Today's Topics:

[Technical][Request for Comment] A new format for the
pageview dumps (Oliver Keyes)

Message: 1
Date: Fri, 13 Mar 2015 15:06:00 -0400
From: Oliver Keyes okeyes@wikimedia.org
To: "A mailing list for the Analytics Team at WMF and everybody who
   has an    interest in Wikipedia and analytics."
   analytics@lists.wikimedia.org,    Research into Wikimedia content and
   communities    wiki-research-l@lists.wikimedia.org
Subject: [Analytics] [Technical][Request for Comment] A new format for
   the    pageview dumps
Message-ID:
   CAAUQgdCSVg8htCS4VFDZJaL2rUEjmk5ZEA+zXcp3o9OWn-udfQ@mail.gmail.com
Content-Type: text/plain; charset=UTF-8
So, we've got a new pageviews definition; it's nicely integrated and
spitting out TRUE/FALSE values on each row with the best of em. But
what does that mean for third-party researchers?
Well...not much, at the moment, because the data isn't being released
somewhere. But one resource we do have that third-parties use a heck
of a lot, is the per-page pageviews dumps on dumps.wikimedia.org.
Due to historical size constrains and decision-making (and by
historical I mean: last decade) these have a number of weirdnesses in
formatting terms; project identification is done using a notation
style not really used anywhere else, mobile/zero/desktop appear on
different lines, and the files are space-separated. I'd like to put
some volunteer time into spitting out dumps in an easier-to-work-with
format, using the new definition, to run in /parallel/ with the
existing logs.
*The new format*
At the moment we have the format:
project_notation - encoded_title - pageviews - bytes
This puts zero and mobile requests to pageX in a different place to
desktop requests, requires some reconstruction of project_notation,
and contains (for some use cases) extraneous information - that being
the byte-count. The files are also headerless, unquoted and
space-separated, which saves space but is sometimes...I think the term
is "eeeeh-inducing".
What I'd like to use as a new format is:
full_project_url - encoded_title - desktop_pageviews - mobile_and_zero_pageviews
This file would:

Include a header row;
Be formatted as a tab-separated, rather than space-separated, file;
Exclude bytecounts;
Include desktop and mobile pageview counts on the same line;
Use the full project URL ("en.wikivoyage.org") instead of the

pagecounts-specific notation ("en.v")
So, as a made-up example, instead of:
de.m.v Florence 32 9024
de.v Florence 920 7570
we'd end up with:
de.wikivoyage.org Florence 920 32
In the future we could also work to /normalise/ the title - replacing
it with the page title that refers to the actual pageID. This won't
impact legacy files, and is currently blocked on the Apps team, but
should be viable as soon as that blocker goes away.
I've written a script capable of parsing and reformatting the legacy
files, so we should be able to backfill in this new format too, if
that's wanted (see below).
*The size constraints*
There really aren't any. Like I said, the historical rationale for a
lot of these decisions seems to have been keeping the files small. But
by putting requests to the same title from different site versions on
the same line, and dropping byte-count, we save enough space that the
resulting files are approximately the same size as the old ones - or
in many cases, actually smaller.
*What I'm asking for*
Feedback! What do people think of the new format? What would they like
to see that they don't? What don't they need, here? How useful would
normalisation be? How useful would backfilling be?
*What I'm not asking for*
WMF time! Like I said, this is a spare-time project; I've also got
volunteers for Code Review and checking, too (Yuvi and Otto).
The replacement of the old files! Too many people depend on that
format and that definition, and I don't want to make them sad.
Thoughts?
-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation

Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
End of Analytics Digest, Vol 37, Issue 33