A big +1 to Erik. As he says, clients can switch over; let's deprecate
the old format and files. We can keep them around (and there is plenty
of code in various languages for _reading_ that format) but there's no
need to be restricted by it.
On 24 December 2015 at 09:41, Erik Zachte <ezachte(a)wikimedia.org> wrote:
Happy Holidays indeed, everyone!
Let's celebrate an eventful year with lots of progress on the Analytics
front. But also open issues waiting to be addressed asap in the next year.
My personal priority is to get the geographical reports back up running, now
that Dan implemented a new geo data feed using hive data, earlier this
month. Thanks again, Dan!
From: Analytics [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of
Dan Andreescu
Sent: Thursday, December 24, 2015 15:25
To: A mailing list for the Analytics Team at WMF and everybody who has an
interest in Wikipedia and analytics.
Subject: Re: [Analytics] [Pageviews] [Technical] Simplifying the available
static dumps of pageview data
Apologies! I realized it was Christmas Eve but I by no means meant to rush
this conversation. Take as long as you like to answer to the thread and
enjoy your holidays everyone :) I'll poke the thread again after the New
Year. Happy Holidays!
On Thu, Dec 24, 2015 at 9:21 AM, Erik Zachte <ezachte(a)wikimedia.org> wrote:
Dan, thanks for raising the issue (a bit less for raising it on X-mas eve
;-) (just kidding, mostly)
Frankly I don't see much use for the earlier releases at all. The newest
version had been kept very much downward compatible, migration of clients
should be a no-brainer (mostly switching download url). Upgrading those same
clients to also use the new additional counts is bit more work as the coding
scheme is tedious (as a result of that downward compatability). But that
upgrading could be done later.
I propose to deprecate both earlier sets, and set an end date for updating
those, e.g. July 1, and publish that widely, and offer support with
migration. If people feel otherwise please chime in. Keeping the existing
files is another matter, we should do so of course.
About my aggregation datasets, it's just that: an aggregation of hourly
files into daily and monthly aggregates, with extreme compression while
retaining hourly precision, and adjusting for missing data (by
extrapolation). These files are ideal for batch processes and lean
downloads, and archiving for the longer haul.
Reworking the datasets, in whatever way, with categories as part of the
scheme sounds like a major overhaul, not like cleaning up old stuff.
Exciting, but best to be done under a separate flag.
Cheers,
Erik
From: Analytics [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of
Maurice Vergeer
Sent: Thursday, December 24, 2015 15:12
To: A mailing list for the Analytics Team at WMF and everybody who has an
interest in Wikipedia and analytics.
Subject: Re: [Analytics] [Pageviews] [Technical] Simplifying the available
static dumps of pageview data
Dear all,
As I just mentioned to Dan in a private email conversation, keeping datasets
even with imperfect measurements is important. Particularly for longitudinal
analysis.
Also, from what I understand - me being a newby here - is that the data are
stored in separate files. Dan suggested reordering the page into categories.
Maybe, another option is to create more extensive datasets with more
different measurements in a single datafile. On the other hand, the files
would become even bigger in size. Not an issue for mee, but for users in the
field accesibility (dowlnload bandwidth) could become an issue.
my two cents
Maurice
On Thu, Dec 24, 2015 at 2:58 PM, Alex Druk <alex.druk(a)gmail.com> wrote:
Nothing against this approach!
On Thu, Dec 24, 2015 at 2:55 PM, Dan Andreescu <dandreescu(a)wikimedia.org>
wrote:
On Thu, Dec 24, 2015 at 8:48 AM, Alex Druk <alex.druk(a)gmail.com> wrote:
Hi Dan,
Happy holidays!
Good idea to combine these datasets! However we have one more dataset by
Erik Zachte :
http://dumps.wikimedia.org/other/pagecounts-ez/
And that's an important one! But I was thinking we could re-organize the
page into categories. Erik's dataset could go into a "processed data"
category or something like that. The three I wanted to talk about on this
thread are just the raw data.
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
--
Thank you.
Alex Druk
alex.druk(a)gmail.com
(775) 237-8550 Google voice
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
--
________________________________________________
Maurice Vergeer
To contact me, see
http://mauricevergeer.nl/node/5
To see my publications, see
http://mauricevergeer.nl/node/1
________________________________________________
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics