Apologies! I realized it was Christmas Eve but I by no means meant to rush
this conversation. Take as long as you like to answer to the thread and
enjoy your holidays everyone :) I'll poke the thread again after the New
Year. Happy Holidays!
On Thu, Dec 24, 2015 at 9:21 AM, Erik Zachte <ezachte(a)wikimedia.org> wrote:
Dan, thanks for raising the issue (a bit less for
raising it on X-mas eve
;-) (just kidding, mostly)
Frankly I don't see much use for the earlier releases at all. The newest
version had been kept very much downward compatible, migration of clients
should be a no-brainer (mostly switching download url). Upgrading those
same clients to also use the new additional counts is bit more work as the
coding scheme is tedious (as a result of that downward compatability). But
that upgrading could be done later.
I propose to deprecate both earlier sets, and set an end date for updating
those, e.g. July 1, and publish that widely, and offer support with
migration. If people feel otherwise please chime in. Keeping the existing
files is another matter, we should do so of course.
About my aggregation datasets, it's just that: an aggregation of hourly
files into daily and monthly aggregates, with extreme compression while
retaining hourly precision, and adjusting for missing data (by
extrapolation). These files are ideal for batch processes and lean
downloads, and archiving for the longer haul.
Reworking the datasets, in whatever way, with categories as part of the
scheme sounds like a major overhaul, not like cleaning up old stuff.
Exciting, but best to be done under a separate flag.
Cheers,
Erik
*From:* Analytics [mailto:analytics-bounces@lists.wikimedia.org] *On
Behalf Of *Maurice Vergeer
*Sent:* Thursday, December 24, 2015 15:12
*To:* A mailing list for the Analytics Team at WMF and everybody who has
an interest in Wikipedia and analytics.
*Subject:* Re: [Analytics] [Pageviews] [Technical] Simplifying the
available static dumps of pageview data
Dear all,
As I just mentioned to Dan in a private email conversation, keeping
datasets even with imperfect measurements is important. Particularly for
longitudinal analysis.
Also, from what I understand - me being a newby here - is that the data
are stored in separate files. Dan suggested reordering the page into
categories. Maybe, another option is to create more extensive datasets with
more different measurements in a single datafile. On the other hand, the
files would become even bigger in size. Not an issue for mee, but for users
in the field accesibility (dowlnload bandwidth) could become an issue.
my two cents
Maurice
On Thu, Dec 24, 2015 at 2:58 PM, Alex Druk <alex.druk(a)gmail.com> wrote:
Nothing against this approach!
On Thu, Dec 24, 2015 at 2:55 PM, Dan Andreescu <dandreescu(a)wikimedia.org>
wrote:
On Thu, Dec 24, 2015 at 8:48 AM, Alex Druk <alex.druk(a)gmail.com> wrote:
Hi Dan,
Happy holidays!
Good idea to combine these datasets! However we have one more dataset by
Erik Zachte :
http://dumps.wikimedia.org/other/pagecounts-ez/
And that's an important one! But I was thinking we could re-organize the
page into categories. Erik's dataset could go into a "processed data"
category or something like that. The three I wanted to talk about on this
thread are just the raw data.
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
--
Thank you.
Alex Druk
alex.druk(a)gmail.com
(775) 237-8550 Google voice
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
--
________________________________________________
Maurice Vergeer
To contact me, see
http://mauricevergeer.nl/node/5
To see my publications, see
http://mauricevergeer.nl/node/1
________________________________________________
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics