It's also the International Day of Women and Girls in Science!
Sounds like a good summary.
On 11 February 2016 at 07:31, Dan Andreescu dandreescu@wikimedia.org wrote:
I almost revived this thread on Mardi Gras, but I didn't want to be known as The Holiday Crusher so I waited. Today is relatively safe [1] :)
Ok, there are three main points being made:
- deprecating the old datasets
- liberating ourselves from the old format
- reorganizing the dumps page
My thoughts on each:
- I agree with Dario and Erik's points. Let's keep the old files around,
but stop generating new files in May 2016. To explain this, we'll make a new section called "Deprecated" and put links to the pagecounts-* datasets there.
- I wasn't expecting to talk about format, but it makes sense because, for
example, Erik's dataset is just a pivoted format. So, we could have a section for the Pageview datasets, with links for each format we already have: Domasz archive format, Erik Z compressed format. We could then add a new format that's easier to understand and could even include some of the data we expose via the pageview API. But from an organizational point of view, treating "format" as a separate concept from "dataset" will be an improvement.
- I think it's time we had our own page instead of just being under
dumps.wikimedia.org/other. Let's have dumps.wikimedia.org/analytics and link to it from both the main dumps page and /other. The separation will make it easier to reference other places we have data static file dumps, like datasets.wikimedia.org. And it'll also make it easier to add links and references to how this work is being done and where people can interact with us or help us.
I hope I captured what everyone was saying. If there aren't any objections, I'll send a list of next steps needed to accomplish this, and get to work :)
[1] Today is Be Electrific Day, Get Out Your Guitar Day, Grandmother Achievement Day, National Don't Cry Over Spilled Milk Day, National Inventors' Day, National Make a Friend Day, National Peppermint Patty Day, National Shut-in Visitation Day, Pro Sports Wives Day, Promise Day, Satisfied Staying Single Day, White Shirt Day
On Wed, Jan 6, 2016 at 7:13 PM, Dario Taraborelli dtaraborelli@wikimedia.org wrote:
Erik's proposal sounds very reasonable.
There might be some confusion about what we mean by "keeping the old datasets for longitudinal analysis". No one is planning to remove the old static dumps, just stop generating them/maintaining them going forward.
I also want to echo Nuria regarding the human cost of maintaining multiple definitions. I just finished preparing a response to a reporter who was asking about project-level mobile PV data and I was not immediately able to answer if a specific data source I wanted to cite was using the old or new definition (until I talked to Dan and we looked up together a gerrit patch).
How do people feel about turning off the generation of old dumps by May 2016, i.e. one year after having the two series of data available in parallel?
On Wed, Jan 6, 2016 at 10:17 AM, Nuria Ruiz nuria@wikimedia.org wrote:
As I just mentioned to Dan in a private email conversation, keeping datasets even with imperfect measurements is important. Particularly for longitudinal analysis.
Have in mind that maintaining these old dumps is not "free", it causes a lot of confusion and maintenance costs to have several pageview definitions around. We get a lot of questions about spiky-ness of old definition and we need to maintain software that generates the old files thus, we think is reasonable to ask our users to transition to the new definition and eventually (in a period of months) turn off the old dumps.
On Thu, Dec 24, 2015 at 6:12 AM, Maurice Vergeer m.vergeer@maw.ru.nl wrote:
Dear all,
As I just mentioned to Dan in a private email conversation, keeping datasets even with imperfect measurements is important. Particularly for longitudinal analysis.
Also, from what I understand - me being a newby here - is that the data are stored in separate files. Dan suggested reordering the page into categories. Maybe, another option is to create more extensive datasets with more different measurements in a single datafile. On the other hand, the files would become even bigger in size. Not an issue for mee, but for users in the field accesibility (dowlnload bandwidth) could become an issue.
my two cents Maurice
On Thu, Dec 24, 2015 at 2:58 PM, Alex Druk alex.druk@gmail.com wrote:
Nothing against this approach!
On Thu, Dec 24, 2015 at 2:55 PM, Dan Andreescu dandreescu@wikimedia.org wrote:
On Thu, Dec 24, 2015 at 8:48 AM, Alex Druk alex.druk@gmail.com wrote: > > Hi Dan, > Happy holidays! > Good idea to combine these datasets! However we have one more dataset > by Erik Zachte : http://dumps.wikimedia.org/other/pagecounts-ez/
And that's an important one! But I was thinking we could re-organize the page into categories. Erik's dataset could go into a "processed data" category or something like that. The three I wanted to talk about on this thread are just the raw data.
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Thank you.
Alex Druk alex.druk@gmail.com (775) 237-8550 Google voice
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- ________________________________________________ Maurice Vergeer To contact me, see http://mauricevergeer.nl/node/5 To see my publications, see http://mauricevergeer.nl/node/1 ________________________________________________
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
--
Dario Taraborelli Head of Research, Wikimedia Foundation wikimediafoundation.org • nitens.org • @readermeter
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics