Does "analytics" mean anything in this context? Why not aim for something
like
-Aaron
On Thu, Feb 11, 2016 at 9:39 AM, Oliver Keyes <okeyes(a)wikimedia.org> wrote:
It's also the International Day of Women and Girls
in Science!
Sounds like a good summary.
On 11 February 2016 at 07:31, Dan Andreescu <dandreescu(a)wikimedia.org>
wrote:
I almost revived this thread on Mardi Gras, but I
didn't want to be
known as
The Holiday Crusher so I waited. Today is
relatively safe [1] :)
Ok, there are three main points being made:
1. deprecating the old datasets
2. liberating ourselves from the old format
3. reorganizing the dumps page
My thoughts on each:
1. I agree with Dario and Erik's points. Let's keep the old files
around,
but stop generating new files in May 2016. To
explain this, we'll make a
new section called "Deprecated" and put links to the pagecounts-*
datasets
there.
2. I wasn't expecting to talk about format, but it makes sense because,
for
example, Erik's dataset is just a pivoted
format. So, we could have a
section for the Pageview datasets, with links for each format we already
have: Domasz archive format, Erik Z compressed format. We could then
add a
new format that's easier to understand and
could even include some of the
data we expose via the pageview API. But from an organizational point of
view, treating "format" as a separate concept from "dataset" will be
an
improvement.
3. I think it's time we had our own page instead of just being under
dumps.wikimedia.org/other. Let's have
dumps.wikimedia.org/analytics and
link to it from both the main dumps page and /other. The separation will
make it easier to reference other places we have data static file dumps,
like
datasets.wikimedia.org. And it'll also make it easier to add
links and
references to how this work is being done and
where people can interact
with
us or help us.
I hope I captured what everyone was saying. If there aren't any
objections,
I'll send a list of next steps needed to
accomplish this, and get to
work :)
[1] Today is Be Electrific Day, Get Out Your Guitar Day, Grandmother
Achievement Day, National Don't Cry Over Spilled Milk Day, National
Inventors' Day, National Make a Friend Day, National Peppermint Patty
Day,
National Shut-in Visitation Day, Pro Sports Wives
Day, Promise Day,
Satisfied Staying Single Day, White Shirt Day
On Wed, Jan 6, 2016 at 7:13 PM, Dario Taraborelli
<dtaraborelli(a)wikimedia.org> wrote:
>
> Erik's proposal sounds very reasonable.
>
> There might be some confusion about what we mean by "keeping the old
> datasets for longitudinal analysis". No one is planning to remove the
old
> static dumps, just stop generating
them/maintaining them going forward.
>
> I also want to echo Nuria regarding the human cost of maintaining
multiple
> definitions. I just finished preparing a
response to a reporter who was
> asking about project-level mobile PV data and I was not immediately
able to
> answer if a specific data source I wanted to
cite was using the old or
new
> definition (until I talked to Dan and we
looked up together a gerrit
patch).
>
> How do people feel about turning off the generation of old dumps by May
> 2016, i.e. one year after having the two series of data available in
> parallel?
>
>
>
> On Wed, Jan 6, 2016 at 10:17 AM, Nuria Ruiz <nuria(a)wikimedia.org>
wrote:
>>
>> >As I just mentioned to Dan in a private email conversation, keeping
>> > datasets even with imperfect measurements is important. Particularly
for
>> > longitudinal analysis.
>> Have in mind that maintaining these old dumps is not "free", it causes
a
>> lot of confusion and maintenance costs to
have several pageview
definitions
>> around. We get a lot of questions about
spiky-ness of old definition
and we
>> need to maintain software that generates
the old files thus, we think
is
>> reasonable to ask our users to transition
to the new definition and
>> eventually (in a period of months) turn off the old dumps.
>>
>> On Thu, Dec 24, 2015 at 6:12 AM, Maurice Vergeer <m.vergeer(a)maw.ru.nl>
>> wrote:
>>>
>>> Dear all,
>>>
>>> As I just mentioned to Dan in a private email conversation, keeping
>>> datasets even with imperfect measurements is important. Particularly
for
>>> longitudinal analysis.
>>>
>>> Also, from what I understand - me being a newby here - is that the
data
>>> are stored in separate files. Dan
suggested reordering the page into
>>> categories. Maybe, another option is to create more extensive
datasets
with
>>> more different measurements in a
single datafile. On the other hand,
the
>>> files would become even bigger in
size. Not an issue for mee, but for
users
>>> in the field accesibility (dowlnload
bandwidth) could become an issue.
>>>
>>> my two cents
>>> Maurice
>>>
>>>
>>> On Thu, Dec 24, 2015 at 2:58 PM, Alex Druk <alex.druk(a)gmail.com>
wrote:
>>>>
>>>> Nothing against this approach!
>>>>
>>>> On Thu, Dec 24, 2015 at 2:55 PM, Dan Andreescu
>>>> <dandreescu(a)wikimedia.org> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Dec 24, 2015 at 8:48 AM, Alex Druk
<alex.druk(a)gmail.com>
>>>>> wrote:
>>>>>>
>>>>>> Hi Dan,
>>>>>> Happy holidays!
>>>>>> Good idea to combine these datasets! However we have one more
dataset
>>>>>> by Erik Zachte :
http://dumps.wikimedia.org/other/pagecounts-ez/
>>>>>
>>>>>
>>>>> And that's an important one! But I was thinking we could
re-organize
>>>>> the page into categories.
Erik's dataset could go into a
"processed data"
>>>>> category or something like
that. The three I wanted to talk about
on this
>>> thread are just the raw data.
>>>
>>> _______________________________________________
>>> Analytics mailing list
>>> Analytics(a)lists.wikimedia.org
>>>
https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>>
>>
>>
>> --
>> Thank you.
>>
>> Alex Druk
>> alex.druk(a)gmail.com
>> (775) 237-8550 Google voice
>>
>> _______________________________________________
>> Analytics mailing list
>> Analytics(a)lists.wikimedia.org
>>
https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>
>
>
> --
> ________________________________________________
> Maurice Vergeer
> To contact me, see
http://mauricevergeer.nl/node/5
> To see my publications, see
http://mauricevergeer.nl/node/1
> ________________________________________________
>
> _______________________________________________
> Analytics mailing list
> Analytics(a)lists.wikimedia.org
>
https://lists.wikimedia.org/mailman/listinfo/analytics
>
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
--
Dario Taraborelli Head of Research, Wikimedia Foundation
wikimediafoundation.org •
nitens.org • @readermeter
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
--
Oliver Keyes
Count Logula
Wikimedia Foundation
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics