I should have started this discussion a while ago, but it's easier to catch up on work during vacation :)
We currently have 3 available static file dumps of pageview data. I will explain them here and explain my thoughts on simplifying the situation. Feel free to turn this thread into a wiki.
* PAGECOUNTS-RAW http://dumps.wikimedia.org/other/pagecounts-raw/. We have this data going back to 2007. This is using a very simple pageview definition which incorrectly counts things like banner views as pageviews (for example). * PAGECOUNTS-ALL-SITES http://dumps.wikimedia.org/other/pagecounts-all-sites/. We have this data starting in late 2014. Compared to PAGECOUNTS-RAW, this dataset also adds traffic from the mobile versions of our sites. But it's still using the same simple pageview definition. * PAGEVIEWS http://dumps.wikimedia.org/other/pageviews/. We have this data starting in May 2015. It implements the new and much improved pageview definition https://meta.wikimedia.org/wiki/Research:Page_view that we now use. This is the same pageview definition used in the pageview API. This dataset also removes spider traffic and any automata traffic that we can detect.
All three datasets are in the same format (Domasz's archive format).
So, before we can simplify this confusing situation, we need your help and input about what to keep and how to keep it. Here's the approach I would take:
Combine pagecounts-raw with pagecounts-all-sites into a new dataset called "pagecounts". Keep producing data to this dataset forever, but remove "pagecounts-raw" and "pagecounts-all-sites". This way, we can compare new data with historical data going back as far as we need. We would explain on dumps.wikimedia.org/other that this dataset gains mobile data starting in October 2014, to explain the relative local spike that happens there. This dataset would remain a pretty bad estimate of actual page views, and would remain sensitive to automata and spider spikes. But in combination with the "pageviews" dataset, I think it would be useful.
What do you all think? Sound off in this thread, and if we have consensus I'll start the cleanup.
Hi Dan, Happy holidays! Good idea to combine these datasets! However we have one more dataset by Erik Zachte : http://dumps.wikimedia.org/other/pagecounts-ez/
On Thu, Dec 24, 2015 at 2:41 PM, Dan Andreescu dandreescu@wikimedia.org wrote:
I should have started this discussion a while ago, but it's easier to catch up on work during vacation :)
We currently have 3 available static file dumps of pageview data. I will explain them here and explain my thoughts on simplifying the situation. Feel free to turn this thread into a wiki.
- PAGECOUNTS-RAW http://dumps.wikimedia.org/other/pagecounts-raw/. We
have this data going back to 2007. This is using a very simple pageview definition which incorrectly counts things like banner views as pageviews (for example).
- PAGECOUNTS-ALL-SITES
http://dumps.wikimedia.org/other/pagecounts-all-sites/. We have this data starting in late 2014. Compared to PAGECOUNTS-RAW, this dataset also adds traffic from the mobile versions of our sites. But it's still using the same simple pageview definition.
- PAGEVIEWS http://dumps.wikimedia.org/other/pageviews/. We have this
data starting in May 2015. It implements the new and much improved pageview definition https://meta.wikimedia.org/wiki/Research:Page_view that we now use. This is the same pageview definition used in the pageview API. This dataset also removes spider traffic and any automata traffic that we can detect.
All three datasets are in the same format (Domasz's archive format).
So, before we can simplify this confusing situation, we need your help and input about what to keep and how to keep it. Here's the approach I would take:
Combine pagecounts-raw with pagecounts-all-sites into a new dataset called "pagecounts". Keep producing data to this dataset forever, but remove "pagecounts-raw" and "pagecounts-all-sites". This way, we can compare new data with historical data going back as far as we need. We would explain on dumps.wikimedia.org/other that this dataset gains mobile data starting in October 2014, to explain the relative local spike that happens there. This dataset would remain a pretty bad estimate of actual page views, and would remain sensitive to automata and spider spikes. But in combination with the "pageviews" dataset, I think it would be useful.
What do you all think? Sound off in this thread, and if we have consensus I'll start the cleanup.
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
On Thu, Dec 24, 2015 at 8:48 AM, Alex Druk alex.druk@gmail.com wrote:
Hi Dan, Happy holidays! Good idea to combine these datasets! However we have one more dataset by Erik Zachte : http://dumps.wikimedia.org/other/pagecounts-ez/
And that's an important one! But I was thinking we could re-organize the page into categories. Erik's dataset could go into a "processed data" category or something like that. The three I wanted to talk about on this thread are just the raw data.
Nothing against this approach!
On Thu, Dec 24, 2015 at 2:55 PM, Dan Andreescu dandreescu@wikimedia.org wrote:
On Thu, Dec 24, 2015 at 8:48 AM, Alex Druk alex.druk@gmail.com wrote:
Hi Dan, Happy holidays! Good idea to combine these datasets! However we have one more dataset by Erik Zachte : http://dumps.wikimedia.org/other/pagecounts-ez/
And that's an important one! But I was thinking we could re-organize the page into categories. Erik's dataset could go into a "processed data" category or something like that. The three I wanted to talk about on this thread are just the raw data.
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Dear all,
As I just mentioned to Dan in a private email conversation, keeping datasets even with imperfect measurements is important. Particularly for longitudinal analysis.
Also, from what I understand - me being a newby here - is that the data are stored in separate files. Dan suggested reordering the page into categories. Maybe, another option is to create more extensive datasets with more different measurements in a single datafile. On the other hand, the files would become even bigger in size. Not an issue for mee, but for users in the field accesibility (dowlnload bandwidth) could become an issue.
my two cents Maurice
On Thu, Dec 24, 2015 at 2:58 PM, Alex Druk alex.druk@gmail.com wrote:
Nothing against this approach!
On Thu, Dec 24, 2015 at 2:55 PM, Dan Andreescu dandreescu@wikimedia.org wrote:
On Thu, Dec 24, 2015 at 8:48 AM, Alex Druk alex.druk@gmail.com wrote:
Hi Dan, Happy holidays! Good idea to combine these datasets! However we have one more dataset by Erik Zachte : http://dumps.wikimedia.org/other/pagecounts-ez/
And that's an important one! But I was thinking we could re-organize the page into categories. Erik's dataset could go into a "processed data" category or something like that. The three I wanted to talk about on this thread are just the raw data.
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Thank you.
Alex Druk alex.druk@gmail.com (775) 237-8550 Google voice
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Dan, thanks for raising the issue (a bit less for raising it on X-mas eve ;-) (just kidding, mostly)
Frankly I don't see much use for the earlier releases at all. The newest version had been kept very much downward compatible, migration of clients should be a no-brainer (mostly switching download url). Upgrading those same clients to also use the new additional counts is bit more work as the coding scheme is tedious (as a result of that downward compatability). But that upgrading could be done later.
I propose to deprecate both earlier sets, and set an end date for updating those, e.g. July 1, and publish that widely, and offer support with migration. If people feel otherwise please chime in. Keeping the existing files is another matter, we should do so of course.
About my aggregation datasets, it's just that: an aggregation of hourly files into daily and monthly aggregates, with extreme compression while retaining hourly precision, and adjusting for missing data (by extrapolation). These files are ideal for batch processes and lean downloads, and archiving for the longer haul.
Reworking the datasets, in whatever way, with categories as part of the scheme sounds like a major overhaul, not like cleaning up old stuff. Exciting, but best to be done under a separate flag.
Cheers,
Erik
From: Analytics [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Maurice Vergeer Sent: Thursday, December 24, 2015 15:12 To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: Re: [Analytics] [Pageviews] [Technical] Simplifying the available static dumps of pageview data
Dear all,
As I just mentioned to Dan in a private email conversation, keeping datasets even with imperfect measurements is important. Particularly for longitudinal analysis.
Also, from what I understand - me being a newby here - is that the data are stored in separate files. Dan suggested reordering the page into categories. Maybe, another option is to create more extensive datasets with more different measurements in a single datafile. On the other hand, the files would become even bigger in size. Not an issue for mee, but for users in the field accesibility (dowlnload bandwidth) could become an issue.
my two cents
Maurice
On Thu, Dec 24, 2015 at 2:58 PM, Alex Druk alex.druk@gmail.com wrote:
Nothing against this approach!
On Thu, Dec 24, 2015 at 2:55 PM, Dan Andreescu dandreescu@wikimedia.org wrote:
On Thu, Dec 24, 2015 at 8:48 AM, Alex Druk alex.druk@gmail.com wrote:
Hi Dan,
Happy holidays!
Good idea to combine these datasets! However we have one more dataset by Erik Zachte : http://dumps.wikimedia.org/other/pagecounts-ez/
And that's an important one! But I was thinking we could re-organize the page into categories. Erik's dataset could go into a "processed data" category or something like that. The three I wanted to talk about on this thread are just the raw data.
_______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Apologies! I realized it was Christmas Eve but I by no means meant to rush this conversation. Take as long as you like to answer to the thread and enjoy your holidays everyone :) I'll poke the thread again after the New Year. Happy Holidays!
On Thu, Dec 24, 2015 at 9:21 AM, Erik Zachte ezachte@wikimedia.org wrote:
Dan, thanks for raising the issue (a bit less for raising it on X-mas eve ;-) (just kidding, mostly)
Frankly I don't see much use for the earlier releases at all. The newest version had been kept very much downward compatible, migration of clients should be a no-brainer (mostly switching download url). Upgrading those same clients to also use the new additional counts is bit more work as the coding scheme is tedious (as a result of that downward compatability). But that upgrading could be done later.
I propose to deprecate both earlier sets, and set an end date for updating those, e.g. July 1, and publish that widely, and offer support with migration. If people feel otherwise please chime in. Keeping the existing files is another matter, we should do so of course.
About my aggregation datasets, it's just that: an aggregation of hourly files into daily and monthly aggregates, with extreme compression while retaining hourly precision, and adjusting for missing data (by extrapolation). These files are ideal for batch processes and lean downloads, and archiving for the longer haul.
Reworking the datasets, in whatever way, with categories as part of the scheme sounds like a major overhaul, not like cleaning up old stuff. Exciting, but best to be done under a separate flag.
Cheers,
Erik
*From:* Analytics [mailto:analytics-bounces@lists.wikimedia.org] *On Behalf Of *Maurice Vergeer *Sent:* Thursday, December 24, 2015 15:12 *To:* A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. *Subject:* Re: [Analytics] [Pageviews] [Technical] Simplifying the available static dumps of pageview data
Dear all,
As I just mentioned to Dan in a private email conversation, keeping datasets even with imperfect measurements is important. Particularly for longitudinal analysis.
Also, from what I understand - me being a newby here - is that the data are stored in separate files. Dan suggested reordering the page into categories. Maybe, another option is to create more extensive datasets with more different measurements in a single datafile. On the other hand, the files would become even bigger in size. Not an issue for mee, but for users in the field accesibility (dowlnload bandwidth) could become an issue.
my two cents
Maurice
On Thu, Dec 24, 2015 at 2:58 PM, Alex Druk alex.druk@gmail.com wrote:
Nothing against this approach!
On Thu, Dec 24, 2015 at 2:55 PM, Dan Andreescu dandreescu@wikimedia.org wrote:
On Thu, Dec 24, 2015 at 8:48 AM, Alex Druk alex.druk@gmail.com wrote:
Hi Dan,
Happy holidays!
Good idea to combine these datasets! However we have one more dataset by Erik Zachte : http://dumps.wikimedia.org/other/pagecounts-ez/
And that's an important one! But I was thinking we could re-organize the page into categories. Erik's dataset could go into a "processed data" category or something like that. The three I wanted to talk about on this thread are just the raw data.
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
--
Thank you.
Alex Druk alex.druk@gmail.com (775) 237-8550 Google voice
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
--
Maurice Vergeer To contact me, see http://mauricevergeer.nl/node/5 To see my publications, see http://mauricevergeer.nl/node/1 ________________________________________________
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Happy Holidays indeed, everyone!
Let's celebrate an eventful year with lots of progress on the Analytics front. But also open issues waiting to be addressed asap in the next year.
My personal priority is to get the geographical reports back up running, now that Dan implemented a new geo data feed using hive data, earlier this month. Thanks again, Dan!
From: Analytics [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Dan Andreescu Sent: Thursday, December 24, 2015 15:25 To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: Re: [Analytics] [Pageviews] [Technical] Simplifying the available static dumps of pageview data
Apologies! I realized it was Christmas Eve but I by no means meant to rush this conversation. Take as long as you like to answer to the thread and enjoy your holidays everyone :) I'll poke the thread again after the New Year. Happy Holidays!
On Thu, Dec 24, 2015 at 9:21 AM, Erik Zachte ezachte@wikimedia.org wrote:
Dan, thanks for raising the issue (a bit less for raising it on X-mas eve ;-) (just kidding, mostly)
Frankly I don't see much use for the earlier releases at all. The newest version had been kept very much downward compatible, migration of clients should be a no-brainer (mostly switching download url). Upgrading those same clients to also use the new additional counts is bit more work as the coding scheme is tedious (as a result of that downward compatability). But that upgrading could be done later.
I propose to deprecate both earlier sets, and set an end date for updating those, e.g. July 1, and publish that widely, and offer support with migration. If people feel otherwise please chime in. Keeping the existing files is another matter, we should do so of course.
About my aggregation datasets, it's just that: an aggregation of hourly files into daily and monthly aggregates, with extreme compression while retaining hourly precision, and adjusting for missing data (by extrapolation). These files are ideal for batch processes and lean downloads, and archiving for the longer haul.
Reworking the datasets, in whatever way, with categories as part of the scheme sounds like a major overhaul, not like cleaning up old stuff. Exciting, but best to be done under a separate flag.
Cheers,
Erik
From: Analytics [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Maurice Vergeer Sent: Thursday, December 24, 2015 15:12 To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: Re: [Analytics] [Pageviews] [Technical] Simplifying the available static dumps of pageview data
Dear all,
As I just mentioned to Dan in a private email conversation, keeping datasets even with imperfect measurements is important. Particularly for longitudinal analysis.
Also, from what I understand - me being a newby here - is that the data are stored in separate files. Dan suggested reordering the page into categories. Maybe, another option is to create more extensive datasets with more different measurements in a single datafile. On the other hand, the files would become even bigger in size. Not an issue for mee, but for users in the field accesibility (dowlnload bandwidth) could become an issue.
my two cents
Maurice
On Thu, Dec 24, 2015 at 2:58 PM, Alex Druk alex.druk@gmail.com wrote:
Nothing against this approach!
On Thu, Dec 24, 2015 at 2:55 PM, Dan Andreescu dandreescu@wikimedia.org wrote:
On Thu, Dec 24, 2015 at 8:48 AM, Alex Druk alex.druk@gmail.com wrote:
Hi Dan,
Happy holidays!
Good idea to combine these datasets! However we have one more dataset by Erik Zachte : http://dumps.wikimedia.org/other/pagecounts-ez/
And that's an important one! But I was thinking we could re-organize the page into categories. Erik's dataset could go into a "processed data" category or something like that. The three I wanted to talk about on this thread are just the raw data.
_______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
A big +1 to Erik. As he says, clients can switch over; let's deprecate the old format and files. We can keep them around (and there is plenty of code in various languages for _reading_ that format) but there's no need to be restricted by it.
On 24 December 2015 at 09:41, Erik Zachte ezachte@wikimedia.org wrote:
Happy Holidays indeed, everyone!
Let's celebrate an eventful year with lots of progress on the Analytics front. But also open issues waiting to be addressed asap in the next year.
My personal priority is to get the geographical reports back up running, now that Dan implemented a new geo data feed using hive data, earlier this month. Thanks again, Dan!
From: Analytics [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Dan Andreescu Sent: Thursday, December 24, 2015 15:25
To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: Re: [Analytics] [Pageviews] [Technical] Simplifying the available static dumps of pageview data
Apologies! I realized it was Christmas Eve but I by no means meant to rush this conversation. Take as long as you like to answer to the thread and enjoy your holidays everyone :) I'll poke the thread again after the New Year. Happy Holidays!
On Thu, Dec 24, 2015 at 9:21 AM, Erik Zachte ezachte@wikimedia.org wrote:
Dan, thanks for raising the issue (a bit less for raising it on X-mas eve ;-) (just kidding, mostly)
Frankly I don't see much use for the earlier releases at all. The newest version had been kept very much downward compatible, migration of clients should be a no-brainer (mostly switching download url). Upgrading those same clients to also use the new additional counts is bit more work as the coding scheme is tedious (as a result of that downward compatability). But that upgrading could be done later.
I propose to deprecate both earlier sets, and set an end date for updating those, e.g. July 1, and publish that widely, and offer support with migration. If people feel otherwise please chime in. Keeping the existing files is another matter, we should do so of course.
About my aggregation datasets, it's just that: an aggregation of hourly files into daily and monthly aggregates, with extreme compression while retaining hourly precision, and adjusting for missing data (by extrapolation). These files are ideal for batch processes and lean downloads, and archiving for the longer haul.
Reworking the datasets, in whatever way, with categories as part of the scheme sounds like a major overhaul, not like cleaning up old stuff. Exciting, but best to be done under a separate flag.
Cheers,
Erik
From: Analytics [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Maurice Vergeer Sent: Thursday, December 24, 2015 15:12 To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: Re: [Analytics] [Pageviews] [Technical] Simplifying the available static dumps of pageview data
Dear all,
As I just mentioned to Dan in a private email conversation, keeping datasets even with imperfect measurements is important. Particularly for longitudinal analysis.
Also, from what I understand - me being a newby here - is that the data are stored in separate files. Dan suggested reordering the page into categories. Maybe, another option is to create more extensive datasets with more different measurements in a single datafile. On the other hand, the files would become even bigger in size. Not an issue for mee, but for users in the field accesibility (dowlnload bandwidth) could become an issue.
my two cents
Maurice
On Thu, Dec 24, 2015 at 2:58 PM, Alex Druk alex.druk@gmail.com wrote:
Nothing against this approach!
On Thu, Dec 24, 2015 at 2:55 PM, Dan Andreescu dandreescu@wikimedia.org wrote:
On Thu, Dec 24, 2015 at 8:48 AM, Alex Druk alex.druk@gmail.com wrote:
Hi Dan,
Happy holidays!
Good idea to combine these datasets! However we have one more dataset by Erik Zachte : http://dumps.wikimedia.org/other/pagecounts-ez/
And that's an important one! But I was thinking we could re-organize the page into categories. Erik's dataset could go into a "processed data" category or something like that. The three I wanted to talk about on this thread are just the raw data.
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
--
Thank you.
Alex Druk alex.druk@gmail.com (775) 237-8550 Google voice
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
--
Maurice Vergeer To contact me, see http://mauricevergeer.nl/node/5 To see my publications, see http://mauricevergeer.nl/node/1 ________________________________________________
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
As I just mentioned to Dan in a private email conversation, keeping
datasets even with imperfect measurements is important. Particularly for longitudinal analysis. Have in mind that maintaining these old dumps is not "free", it causes a lot of confusion and maintenance costs to have several pageview definitions around. We get a lot of questions about spiky-ness of old definition and we need to maintain software that generates the old files thus, we think is reasonable to ask our users to transition to the new definition and eventually (in a period of months) turn off the old dumps.
On Thu, Dec 24, 2015 at 6:12 AM, Maurice Vergeer m.vergeer@maw.ru.nl wrote:
Dear all,
As I just mentioned to Dan in a private email conversation, keeping datasets even with imperfect measurements is important. Particularly for longitudinal analysis.
Also, from what I understand - me being a newby here - is that the data are stored in separate files. Dan suggested reordering the page into categories. Maybe, another option is to create more extensive datasets with more different measurements in a single datafile. On the other hand, the files would become even bigger in size. Not an issue for mee, but for users in the field accesibility (dowlnload bandwidth) could become an issue.
my two cents Maurice
On Thu, Dec 24, 2015 at 2:58 PM, Alex Druk alex.druk@gmail.com wrote:
Nothing against this approach!
On Thu, Dec 24, 2015 at 2:55 PM, Dan Andreescu dandreescu@wikimedia.org wrote:
On Thu, Dec 24, 2015 at 8:48 AM, Alex Druk alex.druk@gmail.com wrote:
Hi Dan, Happy holidays! Good idea to combine these datasets! However we have one more dataset by Erik Zachte : http://dumps.wikimedia.org/other/pagecounts-ez/
And that's an important one! But I was thinking we could re-organize the page into categories. Erik's dataset could go into a "processed data" category or something like that. The three I wanted to talk about on this thread are just the raw data.
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Thank you.
Alex Druk alex.druk@gmail.com (775) 237-8550 Google voice
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- ________________________________________________ Maurice Vergeer To contact me, see http://mauricevergeer.nl/node/5 To see my publications, see http://mauricevergeer.nl/node/1 ________________________________________________
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Erik's proposal sounds very reasonable.
There might be some confusion about what we mean by "keeping the old datasets for longitudinal analysis". No one is planning to remove the old static dumps, just stop generating them/maintaining them going forward.
I also want to echo Nuria regarding the human cost of maintaining multiple definitions. I just finished preparing a response to a reporter who was asking about project-level mobile PV data and I was not immediately able to answer if a specific data source I wanted to cite was using the old or new definition (until I talked to Dan and we looked up together a gerrit patch).
How do people feel about turning off the generation of old dumps by *May 2016*, i.e. one year after having the two series of data available in parallel?
On Wed, Jan 6, 2016 at 10:17 AM, Nuria Ruiz nuria@wikimedia.org wrote:
As I just mentioned to Dan in a private email conversation, keeping
datasets even with imperfect measurements is important. Particularly for longitudinal analysis. Have in mind that maintaining these old dumps is not "free", it causes a lot of confusion and maintenance costs to have several pageview definitions around. We get a lot of questions about spiky-ness of old definition and we need to maintain software that generates the old files thus, we think is reasonable to ask our users to transition to the new definition and eventually (in a period of months) turn off the old dumps.
On Thu, Dec 24, 2015 at 6:12 AM, Maurice Vergeer m.vergeer@maw.ru.nl wrote:
Dear all,
As I just mentioned to Dan in a private email conversation, keeping datasets even with imperfect measurements is important. Particularly for longitudinal analysis.
Also, from what I understand - me being a newby here - is that the data are stored in separate files. Dan suggested reordering the page into categories. Maybe, another option is to create more extensive datasets with more different measurements in a single datafile. On the other hand, the files would become even bigger in size. Not an issue for mee, but for users in the field accesibility (dowlnload bandwidth) could become an issue.
my two cents Maurice
On Thu, Dec 24, 2015 at 2:58 PM, Alex Druk alex.druk@gmail.com wrote:
Nothing against this approach!
On Thu, Dec 24, 2015 at 2:55 PM, Dan Andreescu <dandreescu@wikimedia.org
wrote:
On Thu, Dec 24, 2015 at 8:48 AM, Alex Druk alex.druk@gmail.com wrote:
Hi Dan, Happy holidays! Good idea to combine these datasets! However we have one more dataset by Erik Zachte : http://dumps.wikimedia.org/other/pagecounts-ez/
And that's an important one! But I was thinking we could re-organize the page into categories. Erik's dataset could go into a "processed data" category or something like that. The three I wanted to talk about on this thread are just the raw data.
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Thank you.
Alex Druk alex.druk@gmail.com (775) 237-8550 Google voice
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- ________________________________________________ Maurice Vergeer To contact me, see http://mauricevergeer.nl/node/5 To see my publications, see http://mauricevergeer.nl/node/1 ________________________________________________
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
I almost revived this thread on Mardi Gras, but I didn't want to be known as The Holiday Crusher so I waited. Today is relatively safe [1] :)
Ok, there are three main points being made:
1. deprecating the old datasets 2. liberating ourselves from the old format 3. reorganizing the dumps page
My thoughts on each:
1. I agree with Dario and Erik's points. Let's keep the old files around, but stop generating new files in May 2016. To explain this, we'll make a new section called "Deprecated" and put links to the pagecounts-* datasets there.
2. I wasn't expecting to talk about format, but it makes sense because, for example, Erik's dataset is just a pivoted format. So, we could have a section for the Pageview datasets, with links for each format we already have: Domasz archive format, Erik Z compressed format. We could then add a new format that's easier to understand and could even include some of the data we expose via the pageview API. But from an organizational point of view, treating "format" as a separate concept from "dataset" will be an improvement.
3. I think it's time we had our own page instead of just being under dumps.wikimedia.org/other. Let's have dumps.wikimedia.org/analytics and link to it from both the main dumps page and /other. The separation will make it easier to reference other places we have data static file dumps, like datasets.wikimedia.org. And it'll also make it easier to add links and references to how this work is being done and where people can interact with us or help us.
I hope I captured what everyone was saying. If there aren't any objections, I'll send a list of next steps needed to accomplish this, and get to work :)
[1] Today is Be Electrific Day, Get Out Your Guitar Day, Grandmother Achievement Day, National Don't Cry Over Spilled Milk Day, National Inventors' Day, National Make a Friend Day, National Peppermint Patty Day, National Shut-in Visitation Day, Pro Sports Wives Day, Promise Day, Satisfied Staying Single Day, White Shirt Day
On Wed, Jan 6, 2016 at 7:13 PM, Dario Taraborelli < dtaraborelli@wikimedia.org> wrote:
Erik's proposal sounds very reasonable.
There might be some confusion about what we mean by "keeping the old datasets for longitudinal analysis". No one is planning to remove the old static dumps, just stop generating them/maintaining them going forward.
I also want to echo Nuria regarding the human cost of maintaining multiple definitions. I just finished preparing a response to a reporter who was asking about project-level mobile PV data and I was not immediately able to answer if a specific data source I wanted to cite was using the old or new definition (until I talked to Dan and we looked up together a gerrit patch).
How do people feel about turning off the generation of old dumps by *May 2016*, i.e. one year after having the two series of data available in parallel?
On Wed, Jan 6, 2016 at 10:17 AM, Nuria Ruiz nuria@wikimedia.org wrote:
As I just mentioned to Dan in a private email conversation, keeping
datasets even with imperfect measurements is important. Particularly for longitudinal analysis. Have in mind that maintaining these old dumps is not "free", it causes a lot of confusion and maintenance costs to have several pageview definitions around. We get a lot of questions about spiky-ness of old definition and we need to maintain software that generates the old files thus, we think is reasonable to ask our users to transition to the new definition and eventually (in a period of months) turn off the old dumps.
On Thu, Dec 24, 2015 at 6:12 AM, Maurice Vergeer m.vergeer@maw.ru.nl wrote:
Dear all,
As I just mentioned to Dan in a private email conversation, keeping datasets even with imperfect measurements is important. Particularly for longitudinal analysis.
Also, from what I understand - me being a newby here - is that the data are stored in separate files. Dan suggested reordering the page into categories. Maybe, another option is to create more extensive datasets with more different measurements in a single datafile. On the other hand, the files would become even bigger in size. Not an issue for mee, but for users in the field accesibility (dowlnload bandwidth) could become an issue.
my two cents Maurice
On Thu, Dec 24, 2015 at 2:58 PM, Alex Druk alex.druk@gmail.com wrote:
Nothing against this approach!
On Thu, Dec 24, 2015 at 2:55 PM, Dan Andreescu < dandreescu@wikimedia.org> wrote:
On Thu, Dec 24, 2015 at 8:48 AM, Alex Druk alex.druk@gmail.com wrote:
Hi Dan, Happy holidays! Good idea to combine these datasets! However we have one more dataset by Erik Zachte : http://dumps.wikimedia.org/other/pagecounts-ez/
And that's an important one! But I was thinking we could re-organize the page into categories. Erik's dataset could go into a "processed data" category or something like that. The three I wanted to talk about on this thread are just the raw data.
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Thank you.
Alex Druk alex.druk@gmail.com (775) 237-8550 Google voice
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- ________________________________________________ Maurice Vergeer To contact me, see http://mauricevergeer.nl/node/5 To see my publications, see http://mauricevergeer.nl/node/1 ________________________________________________
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
--
*Dario Taraborelli *Head of Research, Wikimedia Foundation wikimediafoundation.org • nitens.org • @readermeter http://twitter.com/readermeter
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
It's also the International Day of Women and Girls in Science!
Sounds like a good summary.
On 11 February 2016 at 07:31, Dan Andreescu dandreescu@wikimedia.org wrote:
I almost revived this thread on Mardi Gras, but I didn't want to be known as The Holiday Crusher so I waited. Today is relatively safe [1] :)
Ok, there are three main points being made:
- deprecating the old datasets
- liberating ourselves from the old format
- reorganizing the dumps page
My thoughts on each:
- I agree with Dario and Erik's points. Let's keep the old files around,
but stop generating new files in May 2016. To explain this, we'll make a new section called "Deprecated" and put links to the pagecounts-* datasets there.
- I wasn't expecting to talk about format, but it makes sense because, for
example, Erik's dataset is just a pivoted format. So, we could have a section for the Pageview datasets, with links for each format we already have: Domasz archive format, Erik Z compressed format. We could then add a new format that's easier to understand and could even include some of the data we expose via the pageview API. But from an organizational point of view, treating "format" as a separate concept from "dataset" will be an improvement.
- I think it's time we had our own page instead of just being under
dumps.wikimedia.org/other. Let's have dumps.wikimedia.org/analytics and link to it from both the main dumps page and /other. The separation will make it easier to reference other places we have data static file dumps, like datasets.wikimedia.org. And it'll also make it easier to add links and references to how this work is being done and where people can interact with us or help us.
I hope I captured what everyone was saying. If there aren't any objections, I'll send a list of next steps needed to accomplish this, and get to work :)
[1] Today is Be Electrific Day, Get Out Your Guitar Day, Grandmother Achievement Day, National Don't Cry Over Spilled Milk Day, National Inventors' Day, National Make a Friend Day, National Peppermint Patty Day, National Shut-in Visitation Day, Pro Sports Wives Day, Promise Day, Satisfied Staying Single Day, White Shirt Day
On Wed, Jan 6, 2016 at 7:13 PM, Dario Taraborelli dtaraborelli@wikimedia.org wrote:
Erik's proposal sounds very reasonable.
There might be some confusion about what we mean by "keeping the old datasets for longitudinal analysis". No one is planning to remove the old static dumps, just stop generating them/maintaining them going forward.
I also want to echo Nuria regarding the human cost of maintaining multiple definitions. I just finished preparing a response to a reporter who was asking about project-level mobile PV data and I was not immediately able to answer if a specific data source I wanted to cite was using the old or new definition (until I talked to Dan and we looked up together a gerrit patch).
How do people feel about turning off the generation of old dumps by May 2016, i.e. one year after having the two series of data available in parallel?
On Wed, Jan 6, 2016 at 10:17 AM, Nuria Ruiz nuria@wikimedia.org wrote:
As I just mentioned to Dan in a private email conversation, keeping datasets even with imperfect measurements is important. Particularly for longitudinal analysis.
Have in mind that maintaining these old dumps is not "free", it causes a lot of confusion and maintenance costs to have several pageview definitions around. We get a lot of questions about spiky-ness of old definition and we need to maintain software that generates the old files thus, we think is reasonable to ask our users to transition to the new definition and eventually (in a period of months) turn off the old dumps.
On Thu, Dec 24, 2015 at 6:12 AM, Maurice Vergeer m.vergeer@maw.ru.nl wrote:
Dear all,
As I just mentioned to Dan in a private email conversation, keeping datasets even with imperfect measurements is important. Particularly for longitudinal analysis.
Also, from what I understand - me being a newby here - is that the data are stored in separate files. Dan suggested reordering the page into categories. Maybe, another option is to create more extensive datasets with more different measurements in a single datafile. On the other hand, the files would become even bigger in size. Not an issue for mee, but for users in the field accesibility (dowlnload bandwidth) could become an issue.
my two cents Maurice
On Thu, Dec 24, 2015 at 2:58 PM, Alex Druk alex.druk@gmail.com wrote:
Nothing against this approach!
On Thu, Dec 24, 2015 at 2:55 PM, Dan Andreescu dandreescu@wikimedia.org wrote:
On Thu, Dec 24, 2015 at 8:48 AM, Alex Druk alex.druk@gmail.com wrote: > > Hi Dan, > Happy holidays! > Good idea to combine these datasets! However we have one more dataset > by Erik Zachte : http://dumps.wikimedia.org/other/pagecounts-ez/
And that's an important one! But I was thinking we could re-organize the page into categories. Erik's dataset could go into a "processed data" category or something like that. The three I wanted to talk about on this thread are just the raw data.
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Thank you.
Alex Druk alex.druk@gmail.com (775) 237-8550 Google voice
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- ________________________________________________ Maurice Vergeer To contact me, see http://mauricevergeer.nl/node/5 To see my publications, see http://mauricevergeer.nl/node/1 ________________________________________________
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
--
Dario Taraborelli Head of Research, Wikimedia Foundation wikimediafoundation.org • nitens.org • @readermeter
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
dumps.wikimedia.org/analytics
Does "analytics" mean anything in this context? Why not aim for something like dumps.wikimedia.org/views?
-Aaron
On Thu, Feb 11, 2016 at 9:39 AM, Oliver Keyes okeyes@wikimedia.org wrote:
It's also the International Day of Women and Girls in Science!
Sounds like a good summary.
On 11 February 2016 at 07:31, Dan Andreescu dandreescu@wikimedia.org wrote:
I almost revived this thread on Mardi Gras, but I didn't want to be
known as
The Holiday Crusher so I waited. Today is relatively safe [1] :)
Ok, there are three main points being made:
- deprecating the old datasets
- liberating ourselves from the old format
- reorganizing the dumps page
My thoughts on each:
- I agree with Dario and Erik's points. Let's keep the old files
around,
but stop generating new files in May 2016. To explain this, we'll make a new section called "Deprecated" and put links to the pagecounts-*
datasets
there.
- I wasn't expecting to talk about format, but it makes sense because,
for
example, Erik's dataset is just a pivoted format. So, we could have a section for the Pageview datasets, with links for each format we already have: Domasz archive format, Erik Z compressed format. We could then
add a
new format that's easier to understand and could even include some of the data we expose via the pageview API. But from an organizational point of view, treating "format" as a separate concept from "dataset" will be an improvement.
- I think it's time we had our own page instead of just being under
dumps.wikimedia.org/other. Let's have dumps.wikimedia.org/analytics and link to it from both the main dumps page and /other. The separation will make it easier to reference other places we have data static file dumps, like datasets.wikimedia.org. And it'll also make it easier to add
links and
references to how this work is being done and where people can interact
with
us or help us.
I hope I captured what everyone was saying. If there aren't any
objections,
I'll send a list of next steps needed to accomplish this, and get to
work :)
[1] Today is Be Electrific Day, Get Out Your Guitar Day, Grandmother Achievement Day, National Don't Cry Over Spilled Milk Day, National Inventors' Day, National Make a Friend Day, National Peppermint Patty
Day,
National Shut-in Visitation Day, Pro Sports Wives Day, Promise Day, Satisfied Staying Single Day, White Shirt Day
On Wed, Jan 6, 2016 at 7:13 PM, Dario Taraborelli dtaraborelli@wikimedia.org wrote:
Erik's proposal sounds very reasonable.
There might be some confusion about what we mean by "keeping the old datasets for longitudinal analysis". No one is planning to remove the
old
static dumps, just stop generating them/maintaining them going forward.
I also want to echo Nuria regarding the human cost of maintaining
multiple
definitions. I just finished preparing a response to a reporter who was asking about project-level mobile PV data and I was not immediately
able to
answer if a specific data source I wanted to cite was using the old or
new
definition (until I talked to Dan and we looked up together a gerrit
patch).
How do people feel about turning off the generation of old dumps by May 2016, i.e. one year after having the two series of data available in parallel?
On Wed, Jan 6, 2016 at 10:17 AM, Nuria Ruiz nuria@wikimedia.org
wrote:
As I just mentioned to Dan in a private email conversation, keeping datasets even with imperfect measurements is important. Particularly
for
longitudinal analysis.
Have in mind that maintaining these old dumps is not "free", it causes
a
lot of confusion and maintenance costs to have several pageview
definitions
around. We get a lot of questions about spiky-ness of old definition
and we
need to maintain software that generates the old files thus, we think
is
reasonable to ask our users to transition to the new definition and eventually (in a period of months) turn off the old dumps.
On Thu, Dec 24, 2015 at 6:12 AM, Maurice Vergeer m.vergeer@maw.ru.nl wrote:
Dear all,
As I just mentioned to Dan in a private email conversation, keeping datasets even with imperfect measurements is important. Particularly
for
longitudinal analysis.
Also, from what I understand - me being a newby here - is that the
data
are stored in separate files. Dan suggested reordering the page into categories. Maybe, another option is to create more extensive
datasets with
more different measurements in a single datafile. On the other hand,
the
files would become even bigger in size. Not an issue for mee, but for
users
in the field accesibility (dowlnload bandwidth) could become an issue.
my two cents Maurice
On Thu, Dec 24, 2015 at 2:58 PM, Alex Druk alex.druk@gmail.com
wrote:
Nothing against this approach!
On Thu, Dec 24, 2015 at 2:55 PM, Dan Andreescu dandreescu@wikimedia.org wrote: > > > > On Thu, Dec 24, 2015 at 8:48 AM, Alex Druk alex.druk@gmail.com > wrote: >> >> Hi Dan, >> Happy holidays! >> Good idea to combine these datasets! However we have one more
dataset
>> by Erik Zachte : http://dumps.wikimedia.org/other/pagecounts-ez/ > > > And that's an important one! But I was thinking we could
re-organize
> the page into categories. Erik's dataset could go into a
"processed data"
> category or something like that. The three I wanted to talk about
on this
> thread are just the raw data. > > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics >
-- Thank you.
Alex Druk alex.druk@gmail.com (775) 237-8550 Google voice
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- ________________________________________________ Maurice Vergeer To contact me, see http://mauricevergeer.nl/node/5 To see my publications, see http://mauricevergeer.nl/node/1 ________________________________________________
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
--
Dario Taraborelli Head of Research, Wikimedia Foundation wikimediafoundation.org • nitens.org • @readermeter
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Count Logula Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
or maybe dumps.wikimedia.org/traffic?
I hope someday we will (again) have edit stats similar to the views stats we now have (geo breakdown etc).
Erik
From: Analytics [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Aaron Halfaker Sent: Tuesday, February 16, 2016 18:11 To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: Re: [Analytics] [Pageviews] [Technical] Simplifying the available static dumps of pageview data
dumps.wikimedia.org/analytics
Does "analytics" mean anything in this context? Why not aim for something like dumps.wikimedia.org/views?
-Aaron
On Thu, Feb 11, 2016 at 9:39 AM, Oliver Keyes okeyes@wikimedia.org wrote:
It's also the International Day of Women and Girls in Science!
Sounds like a good summary.
On 11 February 2016 at 07:31, Dan Andreescu dandreescu@wikimedia.org wrote:
I almost revived this thread on Mardi Gras, but I didn't want to be known as The Holiday Crusher so I waited. Today is relatively safe [1] :)
Ok, there are three main points being made:
- deprecating the old datasets
- liberating ourselves from the old format
- reorganizing the dumps page
My thoughts on each:
- I agree with Dario and Erik's points. Let's keep the old files around,
but stop generating new files in May 2016. To explain this, we'll make a new section called "Deprecated" and put links to the pagecounts-* datasets there.
- I wasn't expecting to talk about format, but it makes sense because, for
example, Erik's dataset is just a pivoted format. So, we could have a section for the Pageview datasets, with links for each format we already have: Domasz archive format, Erik Z compressed format. We could then add a new format that's easier to understand and could even include some of the data we expose via the pageview API. But from an organizational point of view, treating "format" as a separate concept from "dataset" will be an improvement.
- I think it's time we had our own page instead of just being under
dumps.wikimedia.org/other. Let's have dumps.wikimedia.org/analytics and link to it from both the main dumps page and /other. The separation will make it easier to reference other places we have data static file dumps, like datasets.wikimedia.org. And it'll also make it easier to add links and references to how this work is being done and where people can interact with us or help us.
I hope I captured what everyone was saying. If there aren't any objections, I'll send a list of next steps needed to accomplish this, and get to work :)
[1] Today is Be Electrific Day, Get Out Your Guitar Day, Grandmother Achievement Day, National Don't Cry Over Spilled Milk Day, National Inventors' Day, National Make a Friend Day, National Peppermint Patty Day, National Shut-in Visitation Day, Pro Sports Wives Day, Promise Day, Satisfied Staying Single Day, White Shirt Day
On Wed, Jan 6, 2016 at 7:13 PM, Dario Taraborelli dtaraborelli@wikimedia.org wrote:
Erik's proposal sounds very reasonable.
There might be some confusion about what we mean by "keeping the old datasets for longitudinal analysis". No one is planning to remove the old static dumps, just stop generating them/maintaining them going forward.
I also want to echo Nuria regarding the human cost of maintaining multiple definitions. I just finished preparing a response to a reporter who was asking about project-level mobile PV data and I was not immediately able to answer if a specific data source I wanted to cite was using the old or new definition (until I talked to Dan and we looked up together a gerrit patch).
How do people feel about turning off the generation of old dumps by May 2016, i.e. one year after having the two series of data available in parallel?
On Wed, Jan 6, 2016 at 10:17 AM, Nuria Ruiz nuria@wikimedia.org wrote:
As I just mentioned to Dan in a private email conversation, keeping datasets even with imperfect measurements is important. Particularly for longitudinal analysis.
Have in mind that maintaining these old dumps is not "free", it causes a lot of confusion and maintenance costs to have several pageview definitions around. We get a lot of questions about spiky-ness of old definition and we need to maintain software that generates the old files thus, we think is reasonable to ask our users to transition to the new definition and eventually (in a period of months) turn off the old dumps.
On Thu, Dec 24, 2015 at 6:12 AM, Maurice Vergeer m.vergeer@maw.ru.nl wrote:
Dear all,
As I just mentioned to Dan in a private email conversation, keeping datasets even with imperfect measurements is important. Particularly for longitudinal analysis.
Also, from what I understand - me being a newby here - is that the data are stored in separate files. Dan suggested reordering the page into categories. Maybe, another option is to create more extensive datasets with more different measurements in a single datafile. On the other hand, the files would become even bigger in size. Not an issue for mee, but for users in the field accesibility (dowlnload bandwidth) could become an issue.
my two cents Maurice
On Thu, Dec 24, 2015 at 2:58 PM, Alex Druk alex.druk@gmail.com wrote:
Nothing against this approach!
On Thu, Dec 24, 2015 at 2:55 PM, Dan Andreescu dandreescu@wikimedia.org wrote:
On Thu, Dec 24, 2015 at 8:48 AM, Alex Druk alex.druk@gmail.com wrote: > > Hi Dan, > Happy holidays! > Good idea to combine these datasets! However we have one more dataset > by Erik Zachte : http://dumps.wikimedia.org/other/pagecounts-ez/
And that's an important one! But I was thinking we could re-organize the page into categories. Erik's dataset could go into a "processed data" category or something like that. The three I wanted to talk about on this thread are just the raw data.
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Thank you.
Alex Druk alex.druk@gmail.com (775) 237-8550 tel:%28775%29%20237-8550 Google voice
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- ________________________________________________ Maurice Vergeer To contact me, see http://mauricevergeer.nl/node/5 To see my publications, see http://mauricevergeer.nl/node/1 ________________________________________________
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
--
Dario Taraborelli Head of Research, Wikimedia Foundation wikimediafoundation.org • nitens.org • @readermeter
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
--
Oliver Keyes Count Logula Wikimedia Foundation
_______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
My thought was that /analytics could grow into having edit stats, other research datasets, and whatever we see fit to add there. Over my years here this website remains the main place people come to get bulk data from us, so why keep trying to pull them into other places. We can just bring the data to them :)
I have been accused of over-generalizing but in this case /analytics is a clear simple improvement over /other and we already have more than just traffic or pageview data we can link and explain from there.
And if the concern is that the page will get too complicated once we start adding all kinds of data, then I'd say that's a challenge we can deal with when it happens, but looking forward I think forking into /analytics/traffic and /analytics/edits would be a reasonable solution, and compatible with this first step.
On Tue, Feb 16, 2016 at 12:18 PM, Erik Zachte ezachte@wikimedia.org wrote:
or maybe dumps.wikimedia.org/traffic?
I hope someday we will (again) have edit stats similar to the views stats we now have (geo breakdown etc).
Erik
*From:* Analytics [mailto:analytics-bounces@lists.wikimedia.org] *On Behalf Of *Aaron Halfaker *Sent:* Tuesday, February 16, 2016 18:11 *To:* A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. *Subject:* Re: [Analytics] [Pageviews] [Technical] Simplifying the available static dumps of pageview data
dumps.wikimedia.org/analytics
Does "analytics" mean anything in this context? Why not aim for something like dumps.wikimedia.org/views?
-Aaron
On Thu, Feb 11, 2016 at 9:39 AM, Oliver Keyes okeyes@wikimedia.org wrote:
It's also the International Day of Women and Girls in Science!
Sounds like a good summary.
On 11 February 2016 at 07:31, Dan Andreescu dandreescu@wikimedia.org wrote:
I almost revived this thread on Mardi Gras, but I didn't want to be
known as
The Holiday Crusher so I waited. Today is relatively safe [1] :)
Ok, there are three main points being made:
- deprecating the old datasets
- liberating ourselves from the old format
- reorganizing the dumps page
My thoughts on each:
- I agree with Dario and Erik's points. Let's keep the old files
around,
but stop generating new files in May 2016. To explain this, we'll make a new section called "Deprecated" and put links to the pagecounts-*
datasets
there.
- I wasn't expecting to talk about format, but it makes sense because,
for
example, Erik's dataset is just a pivoted format. So, we could have a section for the Pageview datasets, with links for each format we already have: Domasz archive format, Erik Z compressed format. We could then
add a
new format that's easier to understand and could even include some of the data we expose via the pageview API. But from an organizational point of view, treating "format" as a separate concept from "dataset" will be an improvement.
- I think it's time we had our own page instead of just being under
dumps.wikimedia.org/other. Let's have dumps.wikimedia.org/analytics and link to it from both the main dumps page and /other. The separation will make it easier to reference other places we have data static file dumps, like datasets.wikimedia.org. And it'll also make it easier to add
links and
references to how this work is being done and where people can interact
with
us or help us.
I hope I captured what everyone was saying. If there aren't any
objections,
I'll send a list of next steps needed to accomplish this, and get to
work :)
[1] Today is Be Electrific Day, Get Out Your Guitar Day, Grandmother Achievement Day, National Don't Cry Over Spilled Milk Day, National Inventors' Day, National Make a Friend Day, National Peppermint Patty
Day,
National Shut-in Visitation Day, Pro Sports Wives Day, Promise Day, Satisfied Staying Single Day, White Shirt Day
On Wed, Jan 6, 2016 at 7:13 PM, Dario Taraborelli dtaraborelli@wikimedia.org wrote:
Erik's proposal sounds very reasonable.
There might be some confusion about what we mean by "keeping the old datasets for longitudinal analysis". No one is planning to remove the
old
static dumps, just stop generating them/maintaining them going forward.
I also want to echo Nuria regarding the human cost of maintaining
multiple
definitions. I just finished preparing a response to a reporter who was asking about project-level mobile PV data and I was not immediately
able to
answer if a specific data source I wanted to cite was using the old or
new
definition (until I talked to Dan and we looked up together a gerrit
patch).
How do people feel about turning off the generation of old dumps by May 2016, i.e. one year after having the two series of data available in parallel?
On Wed, Jan 6, 2016 at 10:17 AM, Nuria Ruiz nuria@wikimedia.org
wrote:
As I just mentioned to Dan in a private email conversation, keeping datasets even with imperfect measurements is important. Particularly
for
longitudinal analysis.
Have in mind that maintaining these old dumps is not "free", it causes
a
lot of confusion and maintenance costs to have several pageview
definitions
around. We get a lot of questions about spiky-ness of old definition
and we
need to maintain software that generates the old files thus, we think
is
reasonable to ask our users to transition to the new definition and eventually (in a period of months) turn off the old dumps.
On Thu, Dec 24, 2015 at 6:12 AM, Maurice Vergeer m.vergeer@maw.ru.nl wrote:
Dear all,
As I just mentioned to Dan in a private email conversation, keeping datasets even with imperfect measurements is important. Particularly
for
longitudinal analysis.
Also, from what I understand - me being a newby here - is that the
data
are stored in separate files. Dan suggested reordering the page into categories. Maybe, another option is to create more extensive
datasets with
more different measurements in a single datafile. On the other hand,
the
files would become even bigger in size. Not an issue for mee, but for
users
in the field accesibility (dowlnload bandwidth) could become an issue.
my two cents Maurice
On Thu, Dec 24, 2015 at 2:58 PM, Alex Druk alex.druk@gmail.com
wrote:
Nothing against this approach!
On Thu, Dec 24, 2015 at 2:55 PM, Dan Andreescu dandreescu@wikimedia.org wrote: > > > > On Thu, Dec 24, 2015 at 8:48 AM, Alex Druk alex.druk@gmail.com > wrote: >> >> Hi Dan, >> Happy holidays! >> Good idea to combine these datasets! However we have one more
dataset
>> by Erik Zachte : http://dumps.wikimedia.org/other/pagecounts-ez/ > > > And that's an important one! But I was thinking we could
re-organize
> the page into categories. Erik's dataset could go into a
"processed data"
> category or something like that. The three I wanted to talk about
on this
> thread are just the raw data. > > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics >
-- Thank you.
Alex Druk alex.druk@gmail.com (775) 237-8550 Google voice
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- ________________________________________________ Maurice Vergeer To contact me, see http://mauricevergeer.nl/node/5 To see my publications, see http://mauricevergeer.nl/node/1 ________________________________________________
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
--
Dario Taraborelli Head of Research, Wikimedia Foundation wikimediafoundation.org • nitens.org • @readermeter
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
--
Oliver Keyes Count Logula Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics