Re: [Analytics] [Pageviews] [Technical] Simplifying the available static dumps of pageview data

7 Jan 2016

Erik's proposal sounds very reasonable.

There might be some confusion about what we mean by "keeping the old
datasets for longitudinal analysis". No one is planning to remove the old
static dumps, just stop generating them/maintaining them going forward.

I also want to echo Nuria regarding the human cost of maintaining multiple
definitions. I just finished preparing a response to a reporter who was
asking about project-level mobile PV data and I was not immediately able to
answer if a specific data source I wanted to cite was using the old or new
definition (until I talked to Dan and we looked up together a gerrit
patch).

How do people feel about turning off the generation of old dumps by *May
2016*, i.e. one year after having the two series of data available in
parallel?

On Wed, Jan 6, 2016 at 10:17 AM, Nuria Ruiz &lt;nuria(a)wikimedia.org&gt; wrote:

...
  As I just
mentioned to Dan in a private email conversation, keeping  datasets even with
imperfect measurements is important. Particularly for
 longitudinal analysis.
 Have in mind that maintaining these old dumps is not "free", it causes a
 lot of confusion and maintenance costs to have several pageview definitions
 around. We get a lot of questions about spiky-ness of old definition and we
 need to maintain software that generates the old files thus, we think is
 reasonable to ask our users to transition to the new definition and
 eventually (in a period of months) turn off the old dumps.

 On Thu, Dec 24, 2015 at 6:12 AM, Maurice Vergeer &lt;m.vergeer(a)maw.ru.nl&gt;
 wrote:

  Dear all,

 As I just mentioned to Dan in a private email conversation, keeping
 datasets even with imperfect measurements is important. Particularly for
 longitudinal analysis.

 Also, from what I understand - me being a newby here - is that the data
 are stored in separate files. Dan suggested reordering the page into
 categories. Maybe, another option is to create more extensive datasets with
 more different measurements in a single datafile. On the other hand, the
 files would become even bigger in size. Not an issue for mee, but for users
 in the field accesibility (dowlnload bandwidth) could become an issue.

 my two cents
 Maurice

 On Thu, Dec 24, 2015 at 2:58 PM, Alex Druk &lt;alex.druk(a)gmail.com&gt; wrote:

  Nothing against this approach!

 On Thu, Dec 24, 2015 at 2:55 PM, Dan Andreescu &lt;dandreescu(a)wikimedia.org
  wrote: 

 On Thu, Dec 24, 2015 at 8:48 AM, Alex Druk &lt;alex.druk(a)gmail.com&gt; wrote:

> Hi Dan,
> Happy holidays!
> Good idea to combine these datasets! However we have one more dataset
> by Erik Zachte : http://dumps.wikimedia.org/other/pagecounts-ez/
>

 And that's an important one!  But I was thinking we could re-organize
 the page into categories.  Erik's dataset could go into a "processed data"
 category or something like that.  The three I wanted to talk about on this
 thread are just the raw data.

 _______________________________________________
 Analytics mailing list
 Analytics(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics

 --
 Thank you.

 Alex Druk
 alex.druk(a)gmail.com
 (775) 237-8550 Google voice

 _______________________________________________
 Analytics mailing list
 Analytics(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics

 --
 ________________________________________________
 Maurice Vergeer
 To contact me, see http://mauricevergeer.nl/node/5
 To see my publications, see http://mauricevergeer.nl/node/1
 ________________________________________________

 _______________________________________________
 Analytics mailing list
 Analytics(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics

 _______________________________________________
 Analytics mailing list
 Analytics(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics

-- 

*Dario Taraborelli  *Head of Research, Wikimedia Foundation
wikimediafoundation.org • nitens.org • @readermeter
<http://twitter.com/readermeter>

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Re: [Analytics] [Pageviews] [Technical] Simplifying the available static dumps of pageview data