Hi Dan,
Making dumps much easier to use would definitely help. We Wikipedia
researchers are kind of spoiled: we have easy public access to historical
revision data for all projects, going back to 2001, through the API *and*
public db endpoints like Quarry. It's only natural that we want the same
thing with pageviews!!! :)
I can think of other use-cases for keeping more than 18 months of data
available through the API, but they're all research use cases. I don't
think having lower-granularity historical data available beyond a certain
point is helpful for those--if you're doing historical analysis, you want
consistency. But a application that parsed dumps on the server-side to
yield historical data (ideally in a format and granularity that wasn't
fundamentally different from that of the API, so you could join the
streams) would definitely be useful, and would probably address most
research needs I can think of, inside and outside the Foundation.
Thanks for asking,
Jonathan
On Fri, Jul 29, 2016 at 12:27 PM, Dan Andreescu <dandreescu(a)wikimedia.org
Amir and Jonathan - thanks for speaking up for
the "more than 18 months"
use cases. If dumps were *much* easier to use (via python clients that
made it transparent whether you were hitting the API or not), would that be
an acceptable solution? I feel like both of your use cases are not things
that will be happening on a daily basis. If that's true, another solution
would be an ad-hoc API that took in a filter and a date range, applied it
server-side, and gave you a partial dump with only the interesting data.
If this didn't happen very often, it would allow us to trade processing
time and a bit of dev time for more expensive storage.
Or, if we end up needing frequent access to old data, we should be able
to justify spending more money on more servers. Just trying to save as
much money as possible :)
Thanks all so far, please feel free to keep chiming in if you have other
use cases that haven't been covered, or if you'd like to add more weight
behind the "more than 18 months" use cases.
On Fri, Jul 29, 2016 at 3:18 PM, Leila Zia <leila(a)wikimedia.org> wrote:
Dan, Thanks for reaching out.
18 months is enough for my use cases as long as the dumps capture the
exact data structure.
Best,
Leila
--
Leila Zia
Senior Research Scientist
Wikimedia Foundation
On Fri, Jul 29, 2016 at 11:51 AM, Amir E. Aharoni <
amir.aharoni(a)mail.huji.ac.il> wrote:
> I am now checking traffic data every day to see whether Compact
> Language Links affect it. It makes sense to compare them not only to the
> previous week, but also to the same month previous year. So one year is not
> hardly enough. 18 months is better, and three years is much better because
> I'll be able to check also the same month in earlier years.
>
> I imagine that this may be useful to all product managers that work on
> features that can affect traffic.
>
> בתאריך 29 ביולי 2016 15:41, "Dan Andreescu"
<dandreescu(a)wikimedia.org>
> כתב:
>
>> Dear Pageview API consumers,
>>
>> We would like to plan storage capacity for our pageview API cluster.
>> Right now, with a reliable RAID setup, we can keep *18 months* of
>> data. If you'd like to query further back than that, you can download dump
>> files (which we'll make easier to use with python utilities).
>>
>> What do you think? Will you need more than 18 months of data? If
>> so, we need to add more nodes when we get to that point, and that costs
>> money, so we want to check if there is a real need for it.
>>
>> Another option is to start degrading the resolution for older data
>> (only keep weekly or monthly for data older than 1 year for example). If
>> you need more than 18 months, we'd love to hear your use case and something
>> in the form of:
>>
>> need daily resolution for 1 year
>> need weekly resolution for 2 years
>> need monthly resolution for 3 years
>>
>> Thank you!
>>
>> Dan
>>
>> _______________________________________________
>> Analytics mailing list
>> Analytics(a)lists.wikimedia.org
>>
https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
> _______________________________________________
> Analytics mailing list
> Analytics(a)lists.wikimedia.org
>
https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
--
Jonathan T. Morgan
Senior Design Researcher
Wikimedia Foundation
User:Jmorgan (WMF) <https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)>
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org