[Survey] Pageview API

List overview All Threads
Download

newer

older

Gerrit Cleanup Day on Wed 23rd:...

Get off list

Dan Andreescu

12 Sep 2015 12 Sep '15

4 a.m.

Hi everyone. End of quarter is rapidly approaching and I wanted to ask a quick question about one of the endpoints we want to push out. We want to let you ask "what are the top articles" but we're not sure how to structure the URL so it's most useful to you. Here are the choices:

Choice 1. /top/{project}/{access}/{days-in-the-past}

Example: top articles via all en.wikipedia sites for the past 30 days: /top/en.wikipedia/all-access/30

Choice 2. /top/{project}/{access}/{start}/{end}

Example: top articles via all en.wikipedia sites from June 12th, 2014 to August 30th, 2015: /top/en.wikipedia/all-access/2014-06-12/2015-08-30

(in all of those,

* {project} means en.wikipedia, commons.wikimedia, etc. * {access} means access method as in desktop, mobile web, mobile app

)

Which do you prefer? Would any other query style be useful?

Attachments:

attachment.htm (text/html — 1.1 KB)

Show replies by date

Leila Zia

12 Sep 12 Sep

4:06 a.m.

It's getting exciting. :-)

I'd go with choice 2 since it gives more control to the user while offering what the user can get through choice 1 as well.

Question: will we get page_ids or page_titles or both? It's good to have both.

Leila

On Fri, Sep 11, 2015 at 3:00 PM, Dan Andreescu dandreescu@wikimedia.org wrote:

...

Hi everyone. End of quarter is rapidly approaching and I wanted to ask a quick question about one of the endpoints we want to push out. We want to let you ask "what are the top articles" but we're not sure how to structure the URL so it's most useful to you. Here are the choices:

Choice 1. /top/{project}/{access}/{days-in-the-past}

Example: top articles via all en.wikipedia sites for the past 30 days: /top/en.wikipedia/all-access/30

Choice 2. /top/{project}/{access}/{start}/{end}

Example: top articles via all en.wikipedia sites from June 12th, 2014 to August 30th, 2015: /top/en.wikipedia/all-access/2014-06-12/2015-08-30

(in all of those,

{project} means en.wikipedia, commons.wikimedia, etc.

{access} means access method as in desktop, mobile web, mobile app

)

Which do you prefer? Would any other query style be useful?

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Gabriel Wicke

4:09 a.m.

The former might be slightly easier to cache, and can be linked to / pulled in statically, without a need to dynamically construct a URL. Would it be hard to offer both?

On Fri, Sep 11, 2015 at 3:06 PM, Leila Zia leila@wikimedia.org wrote:

...

It's getting exciting. :-)

I'd go with choice 2 since it gives more control to the user while offering what the user can get through choice 1 as well.

Question: will we get page_ids or page_titles or both? It's good to have both.

Leila

On Fri, Sep 11, 2015 at 3:00 PM, Dan Andreescu dandreescu@wikimedia.org wrote:

...
Hi everyone. End of quarter is rapidly approaching and I wanted to ask a quick question about one of the endpoints we want to push out. We want to let you ask "what are the top articles" but we're not sure how to structure the URL so it's most useful to you. Here are the choices:

Choice 1. /top/{project}/{access}/{days-in-the-past}

Example: top articles via all en.wikipedia sites for the past 30 days: /top/en.wikipedia/all-access/30

Choice 2. /top/{project}/{access}/{start}/{end}

Example: top articles via all en.wikipedia sites from June 12th, 2014 to August 30th, 2015: /top/en.wikipedia/all-access/2014-06-12/2015-08-30

(in all of those,

{project} means en.wikipedia, commons.wikimedia, etc.

{access} means access method as in desktop, mobile web, mobile app

)

Which do you prefer? Would any other query style be useful?

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Gabriel Wicke Principal Engineer, Wikimedia Foundation

Dan Andreescu

4:14 a.m.

It wouldn't be too hard to offer both, but I'm thinking it might be confusing for a consumer. I think ultimately the decision should be up to the people using this data, because the use cases are fairly different for each form. If people ask for both, we'll do both.

Leila, we'd love to have page_ids as well, but we'd have to block the release on a bigger effort to reliably mirror mediawiki databases in Hadoop for processing, so we'll probably punt on that for now. But we have more than many reasons to work on that sooner than later.

On Fri, Sep 11, 2015 at 6:09 PM, Gabriel Wicke gwicke@wikimedia.org wrote:

...

The former might be slightly easier to cache, and can be linked to / pulled in statically, without a need to dynamically construct a URL. Would it be hard to offer both?

On Fri, Sep 11, 2015 at 3:06 PM, Leila Zia leila@wikimedia.org wrote:

...
It's getting exciting. :-)

I'd go with choice 2 since it gives more control to the user while offering what the user can get through choice 1 as well.

Question: will we get page_ids or page_titles or both? It's good to have both.

Leila

On Fri, Sep 11, 2015 at 3:00 PM, Dan Andreescu dandreescu@wikimedia.org wrote:

...
Hi everyone. End of quarter is rapidly approaching and I wanted to ask a quick question about one of the endpoints we want to push out. We want to let you ask "what are the top articles" but we're not sure how to structure the URL so it's most useful to you. Here are the choices:

Choice 1. /top/{project}/{access}/{days-in-the-past}

Example: top articles via all en.wikipedia sites for the past 30 days: /top/en.wikipedia/all-access/30

Choice 2. /top/{project}/{access}/{start}/{end}

Example: top articles via all en.wikipedia sites from June 12th, 2014 to August 30th, 2015: /top/en.wikipedia/all-access/2014-06-12/2015-08-30

(in all of those,

{project} means en.wikipedia, commons.wikimedia, etc.

{access} means access method as in desktop, mobile web, mobile app

)

Which do you prefer? Would any other query style be useful?

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Gabriel Wicke Principal Engineer, Wikimedia Foundation

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Adam Baso

4:27 a.m.

I'd be in favor of both. Maybe with a little tweak to the pathing:

/top/{project}/{access}/days/{days-in-the-past}

/top/{project}/{access}/range/{start}/{end}

with "days" or "range" maybe being earlier in the forward slash separated spec if it doesn't read well semantically.

On Fri, Sep 11, 2015 at 3:14 PM, Dan Andreescu dandreescu@wikimedia.org wrote:

...

It wouldn't be too hard to offer both, but I'm thinking it might be confusing for a consumer. I think ultimately the decision should be up to the people using this data, because the use cases are fairly different for each form. If people ask for both, we'll do both.

Leila, we'd love to have page_ids as well, but we'd have to block the release on a bigger effort to reliably mirror mediawiki databases in Hadoop for processing, so we'll probably punt on that for now. But we have more than many reasons to work on that sooner than later.

On Fri, Sep 11, 2015 at 6:09 PM, Gabriel Wicke gwicke@wikimedia.org wrote:

...
The former might be slightly easier to cache, and can be linked to / pulled in statically, without a need to dynamically construct a URL. Would it be hard to offer both?

On Fri, Sep 11, 2015 at 3:06 PM, Leila Zia leila@wikimedia.org wrote:

...
It's getting exciting. :-)

I'd go with choice 2 since it gives more control to the user while offering what the user can get through choice 1 as well.

Question: will we get page_ids or page_titles or both? It's good to have both.

Leila

On Fri, Sep 11, 2015 at 3:00 PM, Dan Andreescu <dandreescu@wikimedia.org

...
wrote:

...
Hi everyone. End of quarter is rapidly approaching and I wanted to ask a quick question about one of the endpoints we want to push out. We want to let you ask "what are the top articles" but we're not sure how to structure the URL so it's most useful to you. Here are the choices:

Choice 1. /top/{project}/{access}/{days-in-the-past}

Example: top articles via all en.wikipedia sites for the past 30 days: /top/en.wikipedia/all-access/30

Choice 2. /top/{project}/{access}/{start}/{end}

Example: top articles via all en.wikipedia sites from June 12th, 2014 to August 30th, 2015: /top/en.wikipedia/all-access/2014-06-12/2015-08-30

(in all of those,

{project} means en.wikipedia, commons.wikimedia, etc.

{access} means access method as in desktop, mobile web, mobile app

)

Which do you prefer? Would any other query style be useful?

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Gabriel Wicke Principal Engineer, Wikimedia Foundation

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Marcel Ruiz Forns

4:38 a.m.

+1 Adam

Also, maybe *top-articles* instead of *top*, to avoid naming collision in the future?

On Sat, Sep 12, 2015 at 12:27 AM, Adam Baso abaso@wikimedia.org wrote:

...

I'd be in favor of both. Maybe with a little tweak to the pathing:

/top/{project}/{access}/days/{days-in-the-past}

/top/{project}/{access}/range/{start}/{end}

with "days" or "range" maybe being earlier in the forward slash separated spec if it doesn't read well semantically.

On Fri, Sep 11, 2015 at 3:14 PM, Dan Andreescu dandreescu@wikimedia.org wrote:

...
It wouldn't be too hard to offer both, but I'm thinking it might be confusing for a consumer. I think ultimately the decision should be up to the people using this data, because the use cases are fairly different for each form. If people ask for both, we'll do both.

Leila, we'd love to have page_ids as well, but we'd have to block the release on a bigger effort to reliably mirror mediawiki databases in Hadoop for processing, so we'll probably punt on that for now. But we have more than many reasons to work on that sooner than later.

On Fri, Sep 11, 2015 at 6:09 PM, Gabriel Wicke gwicke@wikimedia.org wrote:

...
The former might be slightly easier to cache, and can be linked to / pulled in statically, without a need to dynamically construct a URL. Would it be hard to offer both?

On Fri, Sep 11, 2015 at 3:06 PM, Leila Zia leila@wikimedia.org wrote:

...
It's getting exciting. :-)

I'd go with choice 2 since it gives more control to the user while offering what the user can get through choice 1 as well.

Question: will we get page_ids or page_titles or both? It's good to have both.

Leila

On Fri, Sep 11, 2015 at 3:00 PM, Dan Andreescu <dandreescu@wikimedia .org> wrote:

...
Hi everyone. End of quarter is rapidly approaching and I wanted to ask a quick question about one of the endpoints we want to push out. We want to let you ask "what are the top articles" but we're not sure how to structure the URL so it's most useful to you. Here are the choices:

Choice 1. /top/{project}/{access}/{days-in-the-past}

Example: top articles via all en.wikipedia sites for the past 30 days: /top/en.wikipedia/all-access/30

Choice 2. /top/{project}/{access}/{start}/{end}

Example: top articles via all en.wikipedia sites from June 12th, 2014 to August 30th, 2015: /top/en.wikipedia/all-access/2014-06-12/2015-08-30

(in all of those,

{project} means en.wikipedia, commons.wikimedia, etc.

{access} means access method as in desktop, mobile web, mobile app

)

Which do you prefer? Would any other query style be useful?

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Gabriel Wicke Principal Engineer, Wikimedia Foundation

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- *Marcel Ruiz Forns* Analytics Developer Wikimedia Foundation

Andrew Otto

14 Sep 14 Sep

8:18 p.m.

...

Also, maybe top-articles instead of top, to avoid naming collision in the future?

+1 for prefixing whatever paths you are doing now with something relevant. I sense that there might be more than just pageview data in the future.

/pageviews/top/…?

...

On Sep 11, 2015, at 18:38, Marcel Ruiz Forns mforns@wikimedia.org wrote:

+1 Adam

Also, maybe top-articles instead of top, to avoid naming collision in the future?

On Sat, Sep 12, 2015 at 12:27 AM, Adam Baso <abaso@wikimedia.org mailto:abaso@wikimedia.org> wrote: I'd be in favor of both. Maybe with a little tweak to the pathing:

/top/{project}/{access}/days/{days-in-the-past}

/top/{project}/{access}/range/{start}/{end}

with "days" or "range" maybe being earlier in the forward slash separated spec if it doesn't read well semantically.

On Fri, Sep 11, 2015 at 3:14 PM, Dan Andreescu <dandreescu@wikimedia.org mailto:dandreescu@wikimedia.org> wrote: It wouldn't be too hard to offer both, but I'm thinking it might be confusing for a consumer. I think ultimately the decision should be up to the people using this data, because the use cases are fairly different for each form. If people ask for both, we'll do both.

Leila, we'd love to have page_ids as well, but we'd have to block the release on a bigger effort to reliably mirror mediawiki databases in Hadoop for processing, so we'll probably punt on that for now. But we have more than many reasons to work on that sooner than later.

On Fri, Sep 11, 2015 at 6:09 PM, Gabriel Wicke <gwicke@wikimedia.org mailto:gwicke@wikimedia.org> wrote: The former might be slightly easier to cache, and can be linked to / pulled in statically, without a need to dynamically construct a URL. Would it be hard to offer both?

On Fri, Sep 11, 2015 at 3:06 PM, Leila Zia <leila@wikimedia.org mailto:leila@wikimedia.org> wrote: It's getting exciting. :-)

I'd go with choice 2 since it gives more control to the user while offering what the user can get through choice 1 as well.

Question: will we get page_ids or page_titles or both? It's good to have both.

Leila

On Fri, Sep 11, 2015 at 3:00 PM, Dan Andreescu <dandreescu@wikimedia.org mailto:dandreescu@wikimedia.org> wrote: Hi everyone. End of quarter is rapidly approaching and I wanted to ask a quick question about one of the endpoints we want to push out. We want to let you ask "what are the top articles" but we're not sure how to structure the URL so it's most useful to you. Here are the choices:

Choice 1. /top/{project}/{access}/{days-in-the-past}

Example: top articles via all en.wikipedia sites for the past 30 days: /top/en.wikipedia/all-access/30

Choice 2. /top/{project}/{access}/{start}/{end}

Example: top articles via all en.wikipedia sites from June 12th, 2014 to August 30th, 2015: /top/en.wikipedia/all-access/2014-06-12/2015-08-30

(in all of those,

{project} means en.wikipedia, commons.wikimedia, etc.

{access} means access method as in desktop, mobile web, mobile app

)

Which do you prefer? Would any other query style be useful?

Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics

-- Gabriel Wicke Principal Engineer, Wikimedia Foundation

Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics

-- Marcel Ruiz Forns Analytics Developer Wikimedia Foundation _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Dan Andreescu

8:53 p.m.

Thank you all for your thoughtful opinions.

Since people want to know the top pages over an arbitrary time period, we think Druid would be the best back-end for that kind of query. But we're not going to push that for the first release. It's very useful to know that's the consensus, we can now start talking to Jaime Crespo about Druid / alternatives, make plans, etc. Until then, the first release is going to have the top endpoint that Joseph wrote about. This is easy to pre-aggregate and dump into Cassandra. Also, the /v1/pageviews/ prefix is going to be on all the endpoints we launch with, because these are endpoints in a "pageviews" RESTBase module. So we'll have:

/v1/pageviews/top/{project}/{access}/{year}/{month}/{day}

for now, with {month} and {day} being optional parameters. This will give you the top pageviews for the selected calendar date. And as soon as we can, we'll have:

/v1/pageviews/top/{project}/{access}/from/{start}{/end}

As proposed by Gabriel, with {start} and {end} taking both full dates and "now"-relative negative integers.

The initial endpoint we launch won't have hourly resolution, that seems like too much data to pre-aggregate. But we'll see how Druid handles very specific dates (should be fine) and make that a feature in the second version. We'll have to look into the privacy implications of short time ranges, like an hour.

On Mon, Sep 14, 2015 at 10:18 AM, Andrew Otto aotto@wikimedia.org wrote:

...

Also, maybe *top-articles* instead of *top*, to avoid naming collision in the future?

+1 for prefixing whatever paths you are doing now with something relevant. I sense that there might be more than just pageview data in the future.

/pageviews/top/…?

On Sep 11, 2015, at 18:38, Marcel Ruiz Forns mforns@wikimedia.org wrote:

+1 Adam

Also, maybe *top-articles* instead of *top*, to avoid naming collision in the future?

On Sat, Sep 12, 2015 at 12:27 AM, Adam Baso abaso@wikimedia.org wrote:

...
I'd be in favor of both. Maybe with a little tweak to the pathing:

/top/{project}/{access}/days/{days-in-the-past}

/top/{project}/{access}/range/{start}/{end}

with "days" or "range" maybe being earlier in the forward slash separated spec if it doesn't read well semantically.

On Fri, Sep 11, 2015 at 3:14 PM, Dan Andreescu dandreescu@wikimedia.org wrote:

...
It wouldn't be too hard to offer both, but I'm thinking it might be confusing for a consumer. I think ultimately the decision should be up to the people using this data, because the use cases are fairly different for each form. If people ask for both, we'll do both.

Leila, we'd love to have page_ids as well, but we'd have to block the release on a bigger effort to reliably mirror mediawiki databases in Hadoop for processing, so we'll probably punt on that for now. But we have more than many reasons to work on that sooner than later.

On Fri, Sep 11, 2015 at 6:09 PM, Gabriel Wicke gwicke@wikimedia.org wrote:

...
The former might be slightly easier to cache, and can be linked to / pulled in statically, without a need to dynamically construct a URL. Would it be hard to offer both?

On Fri, Sep 11, 2015 at 3:06 PM, Leila Zia leila@wikimedia.org wrote:

...
It's getting exciting. :-)

I'd go with choice 2 since it gives more control to the user while offering what the user can get through choice 1 as well.

Question: will we get page_ids or page_titles or both? It's good to have both.

Leila

On Fri, Sep 11, 2015 at 3:00 PM, Dan Andreescu <dandreescu@wikimedia .org> wrote:

...
Hi everyone. End of quarter is rapidly approaching and I wanted to ask a quick question about one of the endpoints we want to push out. We want to let you ask "what are the top articles" but we're not sure how to structure the URL so it's most useful to you. Here are the choices:

Choice 1. /top/{project}/{access}/{days-in-the-past}

Example: top articles via all en.wikipedia sites for the past 30 days: /top/en.wikipedia/all-access/30

Choice 2. /top/{project}/{access}/{start}/{end}

Example: top articles via all en.wikipedia sites from June 12th, 2014 to August 30th, 2015: /top/en.wikipedia/all-access/2014-06-12/2015-08-30

(in all of those,

{project} means en.wikipedia, commons.wikimedia, etc.

{access} means access method as in desktop, mobile web, mobile app

)

Which do you prefer? Would any other query style be useful?

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Gabriel Wicke Principal Engineer, Wikimedia Foundation

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- *Marcel Ruiz Forns* Analytics Developer Wikimedia Foundation _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Marko Obrovac

15 Sep 15 Sep

4:57 p.m.

Hello,

Gabriel, Dan and I are discussing this very same topic on T103811~[1,2,3], so please take a look there and weigh in!

As for the specific endpoints, perhaps it'd be worth switching the places of *top* and the project name to be more in line with the current public RESTful URI layout?

Also, I must admit I find the non-determinism of the endpoints confusing to some extent. Specifically I'm referring to the `/{start}/{end}` portion (or, in your notion, this should really be `/{start}{/end}` denoting that `{end}` is an optional URI parameter), the problem being exactly that `{end}` is optional and, if not supplied, the current date is assumed. That entails that the result of making a request to the endpoint without an end date (or TS) depends on the context (the context in this case being the time stamp of the request). So, one day the request encompasses a span of 24h, while the next that same request refers to a 48h period.

I do agree that this makes it easier for humans to issue requests ("Why would I need to write down today's date?"), but APIs are meant to be only *human-friendly*, not *for humans* (yes, there is a difference :P). What I mean is that it should feel natural for humans to create / programme calls to the API and then use these results in their applications/presentations/etc. In that context, there is literally no difference between:

- give me the list of top articles for the past 30 days (this is how the human asks the question) - give me the list of top articles starting from 2015-08-15 (for an application, that's just a matter of computing `current_time() - 1m`) - give me the list of top articles starting form 2015-08-15 and ending on 2015-09-15 (idem as above plus a call to `current_time()`)

Unless, of course, you target mostly human requests, in which case my argument is rendered moot :P

My 2 cents, Marko

[1] https://phabricator.wikimedia.org/T103811 [2] https://phabricator.wikimedia.org/T103811#1639417 [3] https://phabricator.wikimedia.org/T103811#1640977

On 14 September 2015 at 16:53, Dan Andreescu dandreescu@wikimedia.org wrote:

...

Thank you all for your thoughtful opinions.

Since people want to know the top pages over an arbitrary time period, we think Druid would be the best back-end for that kind of query. But we're not going to push that for the first release. It's very useful to know that's the consensus, we can now start talking to Jaime Crespo about Druid / alternatives, make plans, etc. Until then, the first release is going to have the top endpoint that Joseph wrote about. This is easy to pre-aggregate and dump into Cassandra. Also, the /v1/pageviews/ prefix is going to be on all the endpoints we launch with, because these are endpoints in a "pageviews" RESTBase module. So we'll have:

/v1/pageviews/top/{project}/{access}/{year}/{month}/{day}

for now, with {month} and {day} being optional parameters. This will give you the top pageviews for the selected calendar date. And as soon as we can, we'll have:

/v1/pageviews/top/{project}/{access}/from/{start}{/end}

As proposed by Gabriel, with {start} and {end} taking both full dates and "now"-relative negative integers.

The initial endpoint we launch won't have hourly resolution, that seems like too much data to pre-aggregate. But we'll see how Druid handles very specific dates (should be fine) and make that a feature in the second version. We'll have to look into the privacy implications of short time ranges, like an hour.

On Mon, Sep 14, 2015 at 10:18 AM, Andrew Otto aotto@wikimedia.org wrote:

...
Also, maybe *top-articles* instead of *top*, to avoid naming collision in the future?

+1 for prefixing whatever paths you are doing now with something relevant. I sense that there might be more than just pageview data in the future.

/pageviews/top/…?

On Sep 11, 2015, at 18:38, Marcel Ruiz Forns mforns@wikimedia.org wrote:

+1 Adam

Also, maybe *top-articles* instead of *top*, to avoid naming collision in the future?

On Sat, Sep 12, 2015 at 12:27 AM, Adam Baso abaso@wikimedia.org wrote:

...
I'd be in favor of both. Maybe with a little tweak to the pathing:

/top/{project}/{access}/days/{days-in-the-past}

/top/{project}/{access}/range/{start}/{end}

with "days" or "range" maybe being earlier in the forward slash separated spec if it doesn't read well semantically.

On Fri, Sep 11, 2015 at 3:14 PM, Dan Andreescu <dandreescu@wikimedia.org

...
wrote:

...
It wouldn't be too hard to offer both, but I'm thinking it might be confusing for a consumer. I think ultimately the decision should be up to the people using this data, because the use cases are fairly different for each form. If people ask for both, we'll do both.

Leila, we'd love to have page_ids as well, but we'd have to block the release on a bigger effort to reliably mirror mediawiki databases in Hadoop for processing, so we'll probably punt on that for now. But we have more than many reasons to work on that sooner than later.

On Fri, Sep 11, 2015 at 6:09 PM, Gabriel Wicke gwicke@wikimedia.org wrote:

...
The former might be slightly easier to cache, and can be linked to / pulled in statically, without a need to dynamically construct a URL. Would it be hard to offer both?

On Fri, Sep 11, 2015 at 3:06 PM, Leila Zia leila@wikimedia.org wrote:

...
It's getting exciting. :-)

I'd go with choice 2 since it gives more control to the user while offering what the user can get through choice 1 as well.

Question: will we get page_ids or page_titles or both? It's good to have both.

Leila

On Fri, Sep 11, 2015 at 3:00 PM, Dan Andreescu <dandreescu@wikimedia .org> wrote:

> Hi everyone. End of quarter is rapidly approaching and I wanted to > ask a quick question about one of the endpoints we want to push out. We > want to let you ask "what are the top articles" but we're not sure how to > structure the URL so it's most useful to you. Here are the choices: > > Choice 1. /top/{project}/{access}/{days-in-the-past} > > Example: top articles via all en.wikipedia sites for the past 30 > days: /top/en.wikipedia/all-access/30 > > > Choice 2. /top/{project}/{access}/{start}/{end} > > Example: top articles via all en.wikipedia sites from June 12th, > 2014 to August 30th, 2015: > /top/en.wikipedia/all-access/2014-06-12/2015-08-30 > > > (in all of those, > > * {project} means en.wikipedia, commons.wikimedia, etc. > * {access} means access method as in desktop, mobile web, mobile app > > ) > > Which do you prefer? Would any other query style be useful? > > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics > >

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Gabriel Wicke Principal Engineer, Wikimedia Foundation

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- *Marcel Ruiz Forns* Analytics Developer Wikimedia Foundation _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Marko Obrovac, PhD Senior Services Engineer Wikimedia Foundation

Andrew Otto

7:43 p.m.

+1 for just making the URI consistent and not supporting too many nice human edge cases :)

...

On Sep 15, 2015, at 06:57, Marko Obrovac mobrovac@wikimedia.org wrote:

Hello,

Gabriel, Dan and I are discussing this very same topic on T103811~[1,2,3], so please take a look there and weigh in!

As for the specific endpoints, perhaps it'd be worth switching the places of *top* and the project name to be more in line with the current public RESTful URI layout?

Also, I must admit I find the non-determinism of the endpoints confusing to some extent. Specifically I'm referring to the `/{start}/{end}` portion (or, in your notion, this should really be `/{start}{/end}` denoting that `{end}` is an optional URI parameter), the problem being exactly that `{end}` is optional and, if not supplied, the current date is assumed. That entails that the result of making a request to the endpoint without an end date (or TS) depends on the context (the context in this case being the time stamp of the request). So, one day the request encompasses a span of 24h, while the next that same request refers to a 48h period.

I do agree that this makes it easier for humans to issue requests ("Why would I need to write down today's date?"), but APIs are meant to be only *human-friendly*, not *for humans* (yes, there is a difference :P). What I mean is that it should feel natural for humans to create / programme calls to the API and then use these results in their applications/presentations/etc. In that context, there is literally no difference between:

give me the list of top articles for the past 30 days (this is how the human asks the question)

give me the list of top articles starting from 2015-08-15 (for an application, that's just a matter of computing `current_time() - 1m`)

give me the list of top articles starting form 2015-08-15 and ending on 2015-09-15 (idem as above plus a call to `current_time()`)

Unless, of course, you target mostly human requests, in which case my argument is rendered moot :P

My 2 cents, Marko

[1] https://phabricator.wikimedia.org/T103811 https://phabricator.wikimedia.org/T103811 [2] https://phabricator.wikimedia.org/T103811#1639417 https://phabricator.wikimedia.org/T103811#1639417 [3] https://phabricator.wikimedia.org/T103811#1640977 https://phabricator.wikimedia.org/T103811#1640977

On 14 September 2015 at 16:53, Dan Andreescu <dandreescu@wikimedia.org mailto:dandreescu@wikimedia.org> wrote: Thank you all for your thoughtful opinions.

Since people want to know the top pages over an arbitrary time period, we think Druid would be the best back-end for that kind of query. But we're not going to push that for the first release. It's very useful to know that's the consensus, we can now start talking to Jaime Crespo about Druid / alternatives, make plans, etc. Until then, the first release is going to have the top endpoint that Joseph wrote about. This is easy to pre-aggregate and dump into Cassandra. Also, the /v1/pageviews/ prefix is going to be on all the endpoints we launch with, because these are endpoints in a "pageviews" RESTBase module. So we'll have:

/v1/pageviews/top/{project}/{access}/{year}/{month}/{day}

for now, with {month} and {day} being optional parameters. This will give you the top pageviews for the selected calendar date. And as soon as we can, we'll have:

/v1/pageviews/top/{project}/{access}/from/{start}{/end}

As proposed by Gabriel, with {start} and {end} taking both full dates and "now"-relative negative integers.

The initial endpoint we launch won't have hourly resolution, that seems like too much data to pre-aggregate. But we'll see how Druid handles very specific dates (should be fine) and make that a feature in the second version. We'll have to look into the privacy implications of short time ranges, like an hour.

On Mon, Sep 14, 2015 at 10:18 AM, Andrew Otto <aotto@wikimedia.org mailto:aotto@wikimedia.org> wrote:

...
Also, maybe top-articles instead of top, to avoid naming collision in the future?

+1 for prefixing whatever paths you are doing now with something relevant. I sense that there might be more than just pageview data in the future.

/pageviews/top/…?

...
On Sep 11, 2015, at 18:38, Marcel Ruiz Forns <mforns@wikimedia.org mailto:mforns@wikimedia.org> wrote:

+1 Adam

Also, maybe top-articles instead of top, to avoid naming collision in the future?

On Sat, Sep 12, 2015 at 12:27 AM, Adam Baso <abaso@wikimedia.org mailto:abaso@wikimedia.org> wrote: I'd be in favor of both. Maybe with a little tweak to the pathing:

/top/{project}/{access}/days/{days-in-the-past}

/top/{project}/{access}/range/{start}/{end}

with "days" or "range" maybe being earlier in the forward slash separated spec if it doesn't read well semantically.

On Fri, Sep 11, 2015 at 3:14 PM, Dan Andreescu <dandreescu@wikimedia.org mailto:dandreescu@wikimedia.org> wrote: It wouldn't be too hard to offer both, but I'm thinking it might be confusing for a consumer. I think ultimately the decision should be up to the people using this data, because the use cases are fairly different for each form. If people ask for both, we'll do both.

Leila, we'd love to have page_ids as well, but we'd have to block the release on a bigger effort to reliably mirror mediawiki databases in Hadoop for processing, so we'll probably punt on that for now. But we have more than many reasons to work on that sooner than later.

On Fri, Sep 11, 2015 at 6:09 PM, Gabriel Wicke <gwicke@wikimedia.org mailto:gwicke@wikimedia.org> wrote: The former might be slightly easier to cache, and can be linked to / pulled in statically, without a need to dynamically construct a URL. Would it be hard to offer both?

On Fri, Sep 11, 2015 at 3:06 PM, Leila Zia <leila@wikimedia.org mailto:leila@wikimedia.org> wrote: It's getting exciting. :-)

I'd go with choice 2 since it gives more control to the user while offering what the user can get through choice 1 as well.

Question: will we get page_ids or page_titles or both? It's good to have both.

Leila

On Fri, Sep 11, 2015 at 3:00 PM, Dan Andreescu <dandreescu@wikimedia.org mailto:dandreescu@wikimedia.org> wrote: Hi everyone. End of quarter is rapidly approaching and I wanted to ask a quick question about one of the endpoints we want to push out. We want to let you ask "what are the top articles" but we're not sure how to structure the URL so it's most useful to you. Here are the choices:

Choice 1. /top/{project}/{access}/{days-in-the-past}

Example: top articles via all en.wikipedia sites for the past 30 days: /top/en.wikipedia/all-access/30

Choice 2. /top/{project}/{access}/{start}/{end}

Example: top articles via all en.wikipedia sites from June 12th, 2014 to August 30th, 2015: /top/en.wikipedia/all-access/2014-06-12/2015-08-30

(in all of those,

{project} means en.wikipedia, commons.wikimedia, etc.

{access} means access method as in desktop, mobile web, mobile app

)

Which do you prefer? Would any other query style be useful?

Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics

-- Gabriel Wicke Principal Engineer, Wikimedia Foundation

Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics

-- Marcel Ruiz Forns Analytics Developer Wikimedia Foundation _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics

-- Marko Obrovac, PhD Senior Services Engineer Wikimedia Foundation _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Erik Bernhardson

8:18 p.m.

One thing that i want to bring up that may not be captured in the endpoint described above (but maybe exists in another endpoint? I havn't been following). Within search we would like to integrate page view statistics into our completion suggestions api. This indexes will be built once a week from a bulk process. Ideally we would like to be able to send over 100 or so titles and get page average hourly page views (random guess on exactly which, but something that indicates the relative popularity of the page) for the past week.

On Tue, Sep 15, 2015 at 6:43 AM, Andrew Otto aotto@wikimedia.org wrote:

...

+1 for just making the URI consistent and not supporting too many nice human edge cases :)

On Sep 15, 2015, at 06:57, Marko Obrovac mobrovac@wikimedia.org wrote:

Hello,

Gabriel, Dan and I are discussing this very same topic on T103811~[1,2,3], so please take a look there and weigh in!

As for the specific endpoints, perhaps it'd be worth switching the places of *top* and the project name to be more in line with the current public RESTful URI layout?

Also, I must admit I find the non-determinism of the endpoints confusing to some extent. Specifically I'm referring to the `/{start}/{end}` portion (or, in your notion, this should really be `/{start}{/end}` denoting that `{end}` is an optional URI parameter), the problem being exactly that `{end}` is optional and, if not supplied, the current date is assumed. That entails that the result of making a request to the endpoint without an end date (or TS) depends on the context (the context in this case being the time stamp of the request). So, one day the request encompasses a span of 24h, while the next that same request refers to a 48h period.

I do agree that this makes it easier for humans to issue requests ("Why would I need to write down today's date?"), but APIs are meant to be only *human-friendly*, not *for humans* (yes, there is a difference :P). What I mean is that it should feel natural for humans to create / programme calls to the API and then use these results in their applications/presentations/etc. In that context, there is literally no difference between:

give me the list of top articles for the past 30 days (this is how the

human asks the question)

give me the list of top articles starting from 2015-08-15 (for an

application, that's just a matter of computing `current_time() - 1m`)

give me the list of top articles starting form 2015-08-15 and ending on

2015-09-15 (idem as above plus a call to `current_time()`)

Unless, of course, you target mostly human requests, in which case my argument is rendered moot :P

My 2 cents, Marko

[1] https://phabricator.wikimedia.org/T103811 [2] https://phabricator.wikimedia.org/T103811#1639417 [3] https://phabricator.wikimedia.org/T103811#1640977

On 14 September 2015 at 16:53, Dan Andreescu dandreescu@wikimedia.org wrote:

...
Thank you all for your thoughtful opinions.

Since people want to know the top pages over an arbitrary time period, we think Druid would be the best back-end for that kind of query. But we're not going to push that for the first release. It's very useful to know that's the consensus, we can now start talking to Jaime Crespo about Druid / alternatives, make plans, etc. Until then, the first release is going to have the top endpoint that Joseph wrote about. This is easy to pre-aggregate and dump into Cassandra. Also, the /v1/pageviews/ prefix is going to be on all the endpoints we launch with, because these are endpoints in a "pageviews" RESTBase module. So we'll have:

/v1/pageviews/top/{project}/{access}/{year}/{month}/{day}

for now, with {month} and {day} being optional parameters. This will give you the top pageviews for the selected calendar date. And as soon as we can, we'll have:

/v1/pageviews/top/{project}/{access}/from/{start}{/end}

As proposed by Gabriel, with {start} and {end} taking both full dates and "now"-relative negative integers.

The initial endpoint we launch won't have hourly resolution, that seems like too much data to pre-aggregate. But we'll see how Druid handles very specific dates (should be fine) and make that a feature in the second version. We'll have to look into the privacy implications of short time ranges, like an hour.

On Mon, Sep 14, 2015 at 10:18 AM, Andrew Otto aotto@wikimedia.org wrote:

...
Also, maybe *top-articles* instead of *top*, to avoid naming collision in the future?

+1 for prefixing whatever paths you are doing now with something relevant. I sense that there might be more than just pageview data in the future.

/pageviews/top/…?

On Sep 11, 2015, at 18:38, Marcel Ruiz Forns mforns@wikimedia.org wrote:

+1 Adam

Also, maybe *top-articles* instead of *top*, to avoid naming collision in the future?

On Sat, Sep 12, 2015 at 12:27 AM, Adam Baso abaso@wikimedia.org wrote:

...
I'd be in favor of both. Maybe with a little tweak to the pathing:

/top/{project}/{access}/days/{days-in-the-past}

/top/{project}/{access}/range/{start}/{end}

with "days" or "range" maybe being earlier in the forward slash separated spec if it doesn't read well semantically.

On Fri, Sep 11, 2015 at 3:14 PM, Dan Andreescu < dandreescu@wikimedia.org> wrote:

...
It wouldn't be too hard to offer both, but I'm thinking it might be confusing for a consumer. I think ultimately the decision should be up to the people using this data, because the use cases are fairly different for each form. If people ask for both, we'll do both.

Leila, we'd love to have page_ids as well, but we'd have to block the release on a bigger effort to reliably mirror mediawiki databases in Hadoop for processing, so we'll probably punt on that for now. But we have more than many reasons to work on that sooner than later.

On Fri, Sep 11, 2015 at 6:09 PM, Gabriel Wicke gwicke@wikimedia.org wrote:

...
The former might be slightly easier to cache, and can be linked to / pulled in statically, without a need to dynamically construct a URL. Would it be hard to offer both?

On Fri, Sep 11, 2015 at 3:06 PM, Leila Zia leila@wikimedia.org wrote:

> It's getting exciting. :-) > > I'd go with choice 2 since it gives more control to the user while > offering what the user can get through choice 1 as well. > > Question: will we get page_ids or page_titles or both? It's good to > have both. > > Leila > > On Fri, Sep 11, 2015 at 3:00 PM, Dan Andreescu <dandreescu@wikimedia > .org> wrote: > >> Hi everyone. End of quarter is rapidly approaching and I wanted to >> ask a quick question about one of the endpoints we want to push out. We >> want to let you ask "what are the top articles" but we're not sure how to >> structure the URL so it's most useful to you. Here are the choices: >> >> Choice 1. /top/{project}/{access}/{days-in-the-past} >> >> Example: top articles via all en.wikipedia sites for the past 30 >> days: /top/en.wikipedia/all-access/30 >> >> >> Choice 2. /top/{project}/{access}/{start}/{end} >> >> Example: top articles via all en.wikipedia sites from June 12th, >> 2014 to August 30th, 2015: >> /top/en.wikipedia/all-access/2014-06-12/2015-08-30 >> >> >> (in all of those, >> >> * {project} means en.wikipedia, commons.wikimedia, etc. >> * {access} means access method as in desktop, mobile web, mobile app >> >> ) >> >> Which do you prefer? Would any other query style be useful? >> >> _______________________________________________ >> Analytics mailing list >> Analytics@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/analytics >> >> > > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics > >

-- Gabriel Wicke Principal Engineer, Wikimedia Foundation

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- *Marcel Ruiz Forns* Analytics Developer Wikimedia Foundation _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Marko Obrovac, PhD Senior Services Engineer Wikimedia Foundation _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Dan Andreescu

11:07 p.m.

...

One thing that i want to bring up that may not be captured in the endpoint described above (but maybe exists in another endpoint? I havn't been following). Within search we would like to integrate page view statistics into our completion suggestions api. This indexes will be built once a week from a bulk process. Ideally we would like to be able to send over 100 or so titles and get page average hourly page views (random guess on exactly which, but something that indicates the relative popularity of the page) for the past week.

This is possible using our "per-article" endpoint. There you can get hourly pageviews for a particular page title for an arbitrary time range. So grab a week of data for each of the 100 titles, then you can average or do whatever you like.

Erik Bernhardson

11:33 p.m.

On Tue, Sep 15, 2015 at 10:07 AM, Dan Andreescu dandreescu@wikimedia.org wrote:

...

One thing that i want to bring up that may not be captured in the endpoint

...
described above (but maybe exists in another endpoint? I havn't been following). Within search we would like to integrate page view statistics into our completion suggestions api. This indexes will be built once a week from a bulk process. Ideally we would like to be able to send over 100 or so titles and get page average hourly page views (random guess on exactly which, but something that indicates the relative popularity of the page) for the past week.

This is possible using our "per-article" endpoint. There you can get hourly pageviews for a particular page title for an arbitrary time range. So grab a week of data for each of the 100 titles, then you can average or do whatever you like.

I worry a little bit about the performance without having a batch api, but we can certainly try it out and see what happens. Basically we will be requesting the page view information for every NS_MAIN article in every wiki once a week. A quick sum against our search cluster suggests this is ~96 million api requests.

...

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Dan Andreescu

11:37 p.m.

...

I worry a little bit about the performance without having a batch api, but we can certainly try it out and see what happens. Basically we will be requesting the page view information for every NS_MAIN article in every wiki once a week. A quick sum against our search cluster suggests this is ~96 million api requests.

Oh, sorry, I thought you meant you were just querying 100 or so titles! In the case of huge queries like these, you should just query the wmf.pageview_hourly table directly. You can do so with plain SQL via Hive or maybe Impala if we end up setting that up. But those queries should be really fast in that table. We can help you write the query if you send us an attempt and a spec of exactly what you need.

Marko Obrovac

16 Sep 16 Sep

4:56 a.m.

On 15 September 2015 at 19:37, Dan Andreescu dandreescu@wikimedia.org wrote:

...

I worry a little bit about the performance without having a batch api, but

...
we can certainly try it out and see what happens. Basically we will be requesting the page view information for every NS_MAIN article in every wiki once a week. A quick sum against our search cluster suggests this is ~96 million api requests.

96m equals approx 160 req/s which is more than sustainable for RESTBase.

...

Oh, sorry, I thought you meant you were just querying 100 or so titles! In the case of huge queries like these, you should just query the wmf.pageview_hourly table directly. You can do so with plain SQL via Hive or maybe Impala if we end up setting that up. But those queries should be really fast in that table. We can help you write the query if you send us an attempt and a spec of exactly what you need.

My performance-oriented nature would also think about something like that, but I think this is not a decision that is to be taken lightly. While having an API doesn't come for free, its beauty lies in the abstraction. Concretely, as a pageview client, I am aware of the "contract" between the service and myself and as such, I trust it to fulfil its part of the job. How it does it is completely irrelevant to me, thus giving me the opportunity to focus on "my part of the job" (no need for me to worry about the internals of the implementation).

That said, it is clear as day that making 100 requests versus making one batch request takes more time. However, on the one hand, it sounds like Erik's use case is not latency- (or time-) sensitive. On the other, given the nature of the pageview API, the cost of computing the result (in case it is not available right away) dwarfs any connection or other related overheads.

Cheers, Marko

-- Marko Obrovac, PhD Senior Services Engineer Wikimedia Foundation

Dan Andreescu

8:16 a.m.

On Tue, Sep 15, 2015 at 6:56 PM, Marko Obrovac mobrovac@wikimedia.org wrote:

...

On 15 September 2015 at 19:37, Dan Andreescu dandreescu@wikimedia.org wrote:

...
I worry a little bit about the performance without having a batch api,

...
but we can certainly try it out and see what happens. Basically we will be requesting the page view information for every NS_MAIN article in every wiki once a week. A quick sum against our search cluster suggests this is ~96 million api requests.

96m equals approx 160 req/s which is more than sustainable for RESTBase.

True, if we distributed the load over the whole week, but I think Erik needs the results to be available weekly, as in, probably within a day or so of issuing the request. Of course, if we were to serve this kind of request from the API, we would make a better batch-query endpoint for his use case. But I think it might be hard to make that useful generally. I think for now, let's just collect these one-off pageview querying use cases and slowly build them into the API when we can generalize two or more of them into one endpoint.

Erik Bernhardson

9:15 a.m.

makes sense. We will indeed be doing a batch process once a week to build the completion indices which ideally will run through all the wiki's in a day. We are going to do some analysis into how up to date our page view data really needs to be for scoring purposes though, if we can get good scoring results while only updating page view info when a page is edited we might be able to spread out the load across time that way and just hit the page view api once for each edit. Otherwise i'm sure we can do as suggested earlier and pull the data from hive directly and stuff into a temporary structure we can query while building the completion indices.

On Tue, Sep 15, 2015 at 7:16 PM, Dan Andreescu dandreescu@wikimedia.org wrote:

...

On Tue, Sep 15, 2015 at 6:56 PM, Marko Obrovac mobrovac@wikimedia.org wrote:

...
On 15 September 2015 at 19:37, Dan Andreescu dandreescu@wikimedia.org wrote:

...
I worry a little bit about the performance without having a batch api,

...
but we can certainly try it out and see what happens. Basically we will be requesting the page view information for every NS_MAIN article in every wiki once a week. A quick sum against our search cluster suggests this is ~96 million api requests.

96m equals approx 160 req/s which is more than sustainable for RESTBase.

True, if we distributed the load over the whole week, but I think Erik needs the results to be available weekly, as in, probably within a day or so of issuing the request. Of course, if we were to serve this kind of request from the API, we would make a better batch-query endpoint for his use case. But I think it might be hard to make that useful generally. I think for now, let's just collect these one-off pageview querying use cases and slowly build them into the API when we can generalize two or more of them into one endpoint.

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Dan Andreescu

7:21 p.m.

...

Otherwise i'm sure we can do as suggested earlier and pull the data from hive directly and stuff into a temporary structure we can query while building the completion indices.

Do you think that temporary structure might be useful to others? If so, we could add that as a data source, and add an endpoint to query it. Either way, happy to help with the query / temp structure.

Joseph Allemandou

8:03 p.m.

@Erik: Reading this thread makes me think that it might be interesting to have a chat around using hadoop for indexing ( https://github.com/elastic/elasticsearch-hadoop). I have no idea how you currently index, but I'd love to learn :) Please let me know if you think it could be useful ! Joseph

On Wed, Sep 16, 2015 at 5:15 AM, Erik Bernhardson < ebernhardson@wikimedia.org> wrote:

...

makes sense. We will indeed be doing a batch process once a week to build the completion indices which ideally will run through all the wiki's in a day. We are going to do some analysis into how up to date our page view data really needs to be for scoring purposes though, if we can get good scoring results while only updating page view info when a page is edited we might be able to spread out the load across time that way and just hit the page view api once for each edit. Otherwise i'm sure we can do as suggested earlier and pull the data from hive directly and stuff into a temporary structure we can query while building the completion indices.

On Tue, Sep 15, 2015 at 7:16 PM, Dan Andreescu dandreescu@wikimedia.org wrote:

...
On Tue, Sep 15, 2015 at 6:56 PM, Marko Obrovac mobrovac@wikimedia.org wrote:

...
On 15 September 2015 at 19:37, Dan Andreescu dandreescu@wikimedia.org wrote:

...
I worry a little bit about the performance without having a batch api,

...
but we can certainly try it out and see what happens. Basically we will be requesting the page view information for every NS_MAIN article in every wiki once a week. A quick sum against our search cluster suggests this is ~96 million api requests.

96m equals approx 160 req/s which is more than sustainable for RESTBase.

True, if we distributed the load over the whole week, but I think Erik needs the results to be available weekly, as in, probably within a day or so of issuing the request. Of course, if we were to serve this kind of request from the API, we would make a better batch-query endpoint for his use case. But I think it might be hard to make that useful generally. I think for now, let's just collect these one-off pageview querying use cases and slowly build them into the API when we can generalize two or more of them into one endpoint.

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- *Joseph Allemandou* Data Engineer @ Wikimedia Foundation IRC: joal

Toby Negrin

8:06 p.m.

Hadoop was originally built for indexing the web by processing the web map and exporting indexes to serving systems. I think integration with Elastic Search would work well.

-Toby

On Wed, Sep 16, 2015 at 7:03 AM, Joseph Allemandou < jallemandou@wikimedia.org> wrote:

...

@Erik: Reading this thread makes me think that it might be interesting to have a chat around using hadoop for indexing ( https://github.com/elastic/elasticsearch-hadoop). I have no idea how you currently index, but I'd love to learn :) Please let me know if you think it could be useful ! Joseph

On Wed, Sep 16, 2015 at 5:15 AM, Erik Bernhardson < ebernhardson@wikimedia.org> wrote:

...
makes sense. We will indeed be doing a batch process once a week to build the completion indices which ideally will run through all the wiki's in a day. We are going to do some analysis into how up to date our page view data really needs to be for scoring purposes though, if we can get good scoring results while only updating page view info when a page is edited we might be able to spread out the load across time that way and just hit the page view api once for each edit. Otherwise i'm sure we can do as suggested earlier and pull the data from hive directly and stuff into a temporary structure we can query while building the completion indices.

On Tue, Sep 15, 2015 at 7:16 PM, Dan Andreescu dandreescu@wikimedia.org wrote:

...
On Tue, Sep 15, 2015 at 6:56 PM, Marko Obrovac mobrovac@wikimedia.org wrote:

...
On 15 September 2015 at 19:37, Dan Andreescu dandreescu@wikimedia.org wrote:

...
I worry a little bit about the performance without having a batch api,

...
but we can certainly try it out and see what happens. Basically we will be requesting the page view information for every NS_MAIN article in every wiki once a week. A quick sum against our search cluster suggests this is ~96 million api requests.

96m equals approx 160 req/s which is more than sustainable for RESTBase.

True, if we distributed the load over the whole week, but I think Erik needs the results to be available weekly, as in, probably within a day or so of issuing the request. Of course, if we were to serve this kind of request from the API, we would make a better batch-query endpoint for his use case. But I think it might be hard to make that useful generally. I think for now, let's just collect these one-off pageview querying use cases and slowly build them into the API when we can generalize two or more of them into one endpoint.

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- *Joseph Allemandou* Data Engineer @ Wikimedia Foundation IRC: joal

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Marko Obrovac

22 Sep 22 Sep

3:10 p.m.

Hello,

Just a small note which I don't think has been voiced thus far. There will actually be two APIs - one exposed by the Analytics' RESTBase instance, which will be accessible only from inside of WMF's infrastructure, and another, public-facing one (exposed by the Services' RESTBase instance).

Now, these may be identical (both in layout and functionality) or may (slightly) differ. Which way to go? The big pro of them being identical is that the client wouldn't need to care which RESTBase instance it is actually contacting. That would also ease API maintenance. On the down side, that increases the overhead for Analytics to keep their domain list in sync.

Having a more specialised API for the Analytics instance, on the other hand, would allow us to tailor it more for real internal use cases instead of focusing on the overall API coherence (which we need to do for the public-facing API). I'd honestly vote for that option.

On 16 September 2015 at 16:06, Toby Negrin tnegrin@wikimedia.org wrote:

...

Hadoop was originally built for indexing the web by processing the web map and exporting indexes to serving systems. I think integration with Elastic Search would work well.

Right, both are indexing systems (so to speak), but the former is for offline use, while the latter targets online use. Ideally, we should make them cooperate to get the best out of both worlds.

Cheers, Marko

...

-Toby

On Wed, Sep 16, 2015 at 7:03 AM, Joseph Allemandou < jallemandou@wikimedia.org> wrote:

...
@Erik: Reading this thread makes me think that it might be interesting to have a chat around using hadoop for indexing ( https://github.com/elastic/elasticsearch-hadoop). I have no idea how you currently index, but I'd love to learn :) Please let me know if you think it could be useful ! Joseph

On Wed, Sep 16, 2015 at 5:15 AM, Erik Bernhardson < ebernhardson@wikimedia.org> wrote:

...
makes sense. We will indeed be doing a batch process once a week to build the completion indices which ideally will run through all the wiki's in a day. We are going to do some analysis into how up to date our page view data really needs to be for scoring purposes though, if we can get good scoring results while only updating page view info when a page is edited we might be able to spread out the load across time that way and just hit the page view api once for each edit. Otherwise i'm sure we can do as suggested earlier and pull the data from hive directly and stuff into a temporary structure we can query while building the completion indices.

On Tue, Sep 15, 2015 at 7:16 PM, Dan Andreescu <dandreescu@wikimedia.org

...
wrote:

...
On Tue, Sep 15, 2015 at 6:56 PM, Marko Obrovac mobrovac@wikimedia.org wrote:

...
On 15 September 2015 at 19:37, Dan Andreescu <dandreescu@wikimedia.org

...
wrote:

...
I worry a little bit about the performance without having a batch > api, but we can certainly try it out and see what happens. Basically we > will be requesting the page view information for every NS_MAIN article in > every wiki once a week. A quick sum against our search cluster suggests > this is ~96 million api requests. >

96m equals approx 160 req/s which is more than sustainable for RESTBase.

True, if we distributed the load over the whole week, but I think Erik needs the results to be available weekly, as in, probably within a day or so of issuing the request. Of course, if we were to serve this kind of request from the API, we would make a better batch-query endpoint for his use case. But I think it might be hard to make that useful generally. I think for now, let's just collect these one-off pageview querying use cases and slowly build them into the API when we can generalize two or more of them into one endpoint.

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- *Joseph Allemandou* Data Engineer @ Wikimedia Foundation IRC: joal

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Marko Obrovac, PhD Senior Services Engineer Wikimedia Foundation

Oliver Keyes

6:40 p.m.

On 22 September 2015 at 05:10, Marko Obrovac mobrovac@wikimedia.org wrote:

...

Hello,

Just a small note which I don't think has been voiced thus far. There will actually be two APIs - one exposed by the Analytics' RESTBase instance, which will be accessible only from inside of WMF's infrastructure, and another, public-facing one (exposed by the Services' RESTBase instance).

Now, these may be identical (both in layout and functionality) or may (slightly) differ. Which way to go? The big pro of them being identical is that the client wouldn't need to care which RESTBase instance it is actually contacting. That would also ease API maintenance. On the down side, that increases the overhead for Analytics to keep their domain list in sync.

Having a more specialised API for the Analytics instance, on the other hand, would allow us to tailor it more for real internal use cases instead of focusing on the overall API coherence (which we need to do for the public-facing API). I'd honestly vote for that option.

Can you give an example of internal-facing use cases you don't see a broader population of consumers being interested in?

...

On 16 September 2015 at 16:06, Toby Negrin tnegrin@wikimedia.org wrote:

...
Hadoop was originally built for indexing the web by processing the web map and exporting indexes to serving systems. I think integration with Elastic Search would work well.

Right, both are indexing systems (so to speak), but the former is for offline use, while the latter targets online use. Ideally, we should make them cooperate to get the best out of both worlds.

Cheers, Marko

...
-Toby

On Wed, Sep 16, 2015 at 7:03 AM, Joseph Allemandou jallemandou@wikimedia.org wrote:

...
@Erik: Reading this thread makes me think that it might be interesting to have a chat around using hadoop for indexing (https://github.com/elastic/elasticsearch-hadoop). I have no idea how you currently index, but I'd love to learn :) Please let me know if you think it could be useful ! Joseph

On Wed, Sep 16, 2015 at 5:15 AM, Erik Bernhardson ebernhardson@wikimedia.org wrote:

...
makes sense. We will indeed be doing a batch process once a week to build the completion indices which ideally will run through all the wiki's in a day. We are going to do some analysis into how up to date our page view data really needs to be for scoring purposes though, if we can get good scoring results while only updating page view info when a page is edited we might be able to spread out the load across time that way and just hit the page view api once for each edit. Otherwise i'm sure we can do as suggested earlier and pull the data from hive directly and stuff into a temporary structure we can query while building the completion indices.

On Tue, Sep 15, 2015 at 7:16 PM, Dan Andreescu dandreescu@wikimedia.org wrote:

...
On Tue, Sep 15, 2015 at 6:56 PM, Marko Obrovac mobrovac@wikimedia.org wrote:

...
On 15 September 2015 at 19:37, Dan Andreescu dandreescu@wikimedia.org wrote: >> >> I worry a little bit about the performance without having a batch >> api, but we can certainly try it out and see what happens. Basically we will >> be requesting the page view information for every NS_MAIN article in every >> wiki once a week. A quick sum against our search cluster suggests this is >> ~96 million api requests.

96m equals approx 160 req/s which is more than sustainable for RESTBase.

True, if we distributed the load over the whole week, but I think Erik needs the results to be available weekly, as in, probably within a day or so of issuing the request. Of course, if we were to serve this kind of request from the API, we would make a better batch-query endpoint for his use case. But I think it might be hard to make that useful generally. I think for now, let's just collect these one-off pageview querying use cases and slowly build them into the API when we can generalize two or more of them into one endpoint.

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Marko Obrovac, PhD Senior Services Engineer Wikimedia Foundation

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Oliver Keyes Count Logula Wikimedia Foundation

Marko Obrovac

10:53 p.m.

On 22 September 2015 at 14:40, Oliver Keyes okeyes@wikimedia.org wrote:

...

On 22 September 2015 at 05:10, Marko Obrovac mobrovac@wikimedia.org wrote:

...
Hello,

Just a small note which I don't think has been voiced thus far. There

will

...
actually be two APIs - one exposed by the Analytics' RESTBase instance, which will be accessible only from inside of WMF's infrastructure, and another, public-facing one (exposed by the Services' RESTBase instance).

Now, these may be identical (both in layout and functionality) or may (slightly) differ. Which way to go? The big pro of them being identical

is

...
that the client wouldn't need to care which RESTBase instance it is

actually

...
contacting. That would also ease API maintenance. On the down side, that increases the overhead for Analytics to keep their domain list in sync.

Having a more specialised API for the Analytics instance, on the other

hand,

...
would allow us to tailor it more for real internal use cases instead of focusing on the overall API coherence (which we need to do for the public-facing API). I'd honestly vote for that option.

Can you give an example of internal-facing use cases you don't see a broader population of consumers being interested in?

In my mail I was mostly hinting to the fact that the public-facing API is divided by domains, whilst the notion of projects is better suited for Analytics. So the internal API could be organised around projects while still supporting domains but in a looser format than the public one.

We plan to support arbitrary projects (such as en-all, all-wiktionary, etc) on the public side as well, but because of the current layout, the analytics' (public) API will be fragmented. There is no need to do such a thing with the internal API too.

To concretely answer the question, I am not aware of any specific use case. Just pointing out that internal users can, if they need/want, rely on projects rather than on domains.

Cheers, Marko

...

...
On 16 September 2015 at 16:06, Toby Negrin tnegrin@wikimedia.org

wrote:

...
...
Hadoop was originally built for indexing the web by processing the web

map

...
...
and exporting indexes to serving systems. I think integration with

Elastic

...
...
Search would work well.

Right, both are indexing systems (so to speak), but the former is for offline use, while the latter targets online use. Ideally, we should make them cooperate to get the best out of both worlds.

Cheers, Marko

...
-Toby

On Wed, Sep 16, 2015 at 7:03 AM, Joseph Allemandou jallemandou@wikimedia.org wrote:

...
@Erik: Reading this thread makes me think that it might be interesting to

have a

...
...
...
chat around using hadoop for indexing (https://github.com/elastic/elasticsearch-hadoop). I have no idea how you currently index, but I'd love to learn :) Please let me know if you think it could be useful ! Joseph

On Wed, Sep 16, 2015 at 5:15 AM, Erik Bernhardson ebernhardson@wikimedia.org wrote:

...
makes sense. We will indeed be doing a batch process once a week to build the completion indices which ideally will run through all the

wiki's

...
...
...
...
in a day. We are going to do some analysis into how up to date our

page view

...
...
...
...
data really needs to be for scoring purposes though, if we can get

good

...
...
...
...
scoring results while only updating page view info when a page is

edited we

...
...
...
...
might be able to spread out the load across time that way and just

hit the

...
...
...
...
page view api once for each edit. Otherwise i'm sure we can do as

suggested

...
...
...
...
earlier and pull the data from hive directly and stuff into a

temporary

...
...
...
...
structure we can query while building the completion indices.

On Tue, Sep 15, 2015 at 7:16 PM, Dan Andreescu dandreescu@wikimedia.org wrote:

...
On Tue, Sep 15, 2015 at 6:56 PM, Marko Obrovac <

mobrovac@wikimedia.org>

...
...
...
...
...
wrote: > > On 15 September 2015 at 19:37, Dan Andreescu > dandreescu@wikimedia.org wrote: >>> >>> I worry a little bit about the performance without having a batch >>> api, but we can certainly try it out and see what happens.

Basically we will

...
...
...
...
...
>>> be requesting the page view information for every NS_MAIN article

in every

...
...
...
...
...
>>> wiki once a week. A quick sum against our search cluster

suggests this is

...
...
...
...
...
>>> ~96 million api requests. > > > 96m equals approx 160 req/s which is more than sustainable for > RESTBase.

True, if we distributed the load over the whole week, but I think

Erik

...
...
...
...
...
needs the results to be available weekly, as in, probably within a

day or so

...
...
...
...
...
of issuing the request. Of course, if we were to serve this kind of

request

...
...
...
...
...
from the API, we would make a better batch-query endpoint for his

use case.

...
...
...
...
...
But I think it might be hard to make that useful generally. I think

for

...
...
...
...
...
now, let's just collect these one-off pageview querying use cases

and slowly

...
...
...
...
...
build them into the API when we can generalize two or more of them

into one

...
...
...
...
...
endpoint.

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Marko Obrovac, PhD Senior Services Engineer Wikimedia Foundation

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Oliver Keyes Count Logula Wikimedia Foundation

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Marko Obrovac, PhD Senior Services Engineer Wikimedia Foundation

Dan Andreescu

15 Sep 15 Sep

11:19 p.m.

Thanks for your thoughts, Marko.

...

As for the specific endpoints, perhaps it'd be worth switching the places of *top* and the project name to be more in line with the current public RESTful URI layout?

We can talk about this as part of the phab tasks you linked to. If we keep the project as one of the parameters, then it makes sense for it to be further down in the URI, where the other parameters are. "top" in that case would just be the type of query, not a parameter. If we use the domain in place of the project, which we can only do in non-aggregate cases, then that would be in line with the other public RESTful URIs.

...

Also, I must admit I find the non-determinism of the endpoints confusing to some extent. Specifically I'm referring to the `/{start}/{end}` portion (or, in your notion, this should really be `/{start}{/end}` denoting that `{end}` is an optional URI parameter), the problem being exactly that `{end}` is optional and, if not supplied, the current date is assumed. That entails that the result of making a request to the endpoint without an end date (or TS) depends on the context (the context in this case being the time stamp of the request). So, one day the request encompasses a span of 24h, while the next that same request refers to a 48h period.

[... snip good points ...]

Hm, I think I'm leaning towards Marko and Andrew's point of view. Relative-valued parameters seem to make caching confusing to think about too. Like start=-30 and end=-1 would have to be evicted precisely and surely at midnight, but which timezone? :)

Ok, so we'll go with absolute-valued deterministic parameters (our current code has only these types of parameters). Maybe if Druid serves as our data store, we can think about this again and see if we can provide a human-friendly interface too.

Unless, of course, you target mostly human requests, in which case my

...

argument is rendered moot :P

I think it's pretty mixed actually, we have asks from humans, bot-writers, analysts, teams at WMF, etc.

paul＠paulweiss.info

12 Sep 12 Sep

4:19 a.m.

I concur with Leila.

Paul

--------- Original Message --------- Subject: Re: [Analytics] [Survey] Pageview API From: "Leila Zia" leila@wikimedia.org Date: 9/11/15 3:06 pm To: "A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics." analytics@lists.wikimedia.org

It's getting exciting. :-)

I'd go with choice 2 since it gives more control to the user while offering what the user can get through choice 1 as well.

Question: will we get page_ids or page_titles or both? It's good to have both.

Leila

On Fri, Sep 11, 2015 at 3:00 PM, Dan Andreescu dandreescu@wikimedia.org wrote: Hi everyone. End of quarter is rapidly approaching and I wanted to ask a quick question about one of the endpoints we want to push out. We want to let you ask "what are the top articles" but we're not sure how to structure the URL so it's most useful to you. Here are the choices: Choice 1. /top/{project}/{access}/{days-in-the-past}

Example: top articles via all en.wikipedia sites for the past 30 days: /top/en.wikipedia/all-access/30

Choice 2. /top/{project}/{access}/{start}/{end}

Example: top articles via all en.wikipedia sites from June 12th, 2014 to August 30th, 2015: /top/en.wikipedia/all-access/2014-06-12/2015-08-30

(in all of those,

* {project} means en.wikipedia, commons.wikimedia, etc. * {access} means access method as in desktop, mobile web, mobile app

)

Which do you prefer? Would any other query style be useful?

_______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Gabriel Wicke

5:26 a.m.

Another option would be a single entry point

/top/{project}/{access}/from/{start}{/end}

with support for negative indexes for 'days in the past':

/top/{project}/{access}/from/-30 /top/{project}/{access}/from/-60/-30

as well as full dates:

/top/en.wikipedia/all-access/2014-06-12/2015-08-30

On Fri, Sep 11, 2015 at 3:19 PM, paul@paulweiss.info wrote:

...

I concur with Leila.

Paul

--------- Original Message --------- Subject: Re: [Analytics] [Survey] Pageview API From: "Leila Zia" leila@wikimedia.org Date: 9/11/15 3:06 pm To: "A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics." analytics@lists.wikimedia.org

It's getting exciting. :-)

I'd go with choice 2 since it gives more control to the user while offering what the user can get through choice 1 as well.

Question: will we get page_ids or page_titles or both? It's good to have both.

Leila

On Fri, Sep 11, 2015 at 3:00 PM, Dan Andreescu dandreescu@wikimedia.org wrote:

...
Hi everyone. End of quarter is rapidly approaching and I wanted to ask a quick question about one of the endpoints we want to push out. We want to let you ask "what are the top articles" but we're not sure how to structure the URL so it's most useful to you. Here are the choices:

Choice 1. /top/{project}/{access}/{days-in-the-past}

Example: top articles via all en.wikipedia sites for the past 30 days: /top/en.wikipedia/all-access/30

Choice 2. /top/{project}/{access}/{start}/{end}

Example: top articles via all en.wikipedia sites from June 12th, 2014 to August 30th, 2015: /top/en.wikipedia/all-access/2014-06-12/2015-08-30

(in all of those,

{project} means en.wikipedia, commons.wikimedia, etc.

{access} means access method as in desktop, mobile web, mobile app

)

Which do you prefer? Would any other query style be useful?

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

_______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Gabriel Wicke Principal Engineer, Wikimedia Foundation

Gabriel Wicke

5:27 a.m.

On Fri, Sep 11, 2015 at 4:26 PM, Gabriel Wicke gwicke@wikimedia.org wrote:

...

Another option would be a single entry point

/top/{project}/{access}/from/{start}{/end}

with support for negative indexes for 'days in the past':

/top/{project}/{access}/from/-30 /top/{project}/{access}/from/-60/-30

as well as full dates:

/top/en.wikipedia/all-access/2014-06-12/2015-08-30

Correction:

/top/en.wikipedia/all-access/from/2014-06-12/2015-08-30

...

On Fri, Sep 11, 2015 at 3:19 PM, paul@paulweiss.info wrote:

...
I concur with Leila.

Paul

--------- Original Message --------- Subject: Re: [Analytics] [Survey] Pageview API From: "Leila Zia" leila@wikimedia.org Date: 9/11/15 3:06 pm To: "A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics." analytics@lists.wikimedia.org

It's getting exciting. :-)

I'd go with choice 2 since it gives more control to the user while offering what the user can get through choice 1 as well.

Question: will we get page_ids or page_titles or both? It's good to have both.

Leila

On Fri, Sep 11, 2015 at 3:00 PM, Dan Andreescu dandreescu@wikimedia.org wrote:

...
Hi everyone. End of quarter is rapidly approaching and I wanted to ask a quick question about one of the endpoints we want to push out. We want to let you ask "what are the top articles" but we're not sure how to structure the URL so it's most useful to you. Here are the choices:

Choice 1. /top/{project}/{access}/{days-in-the-past}

Example: top articles via all en.wikipedia sites for the past 30 days: /top/en.wikipedia/all-access/30

Choice 2. /top/{project}/{access}/{start}/{end}

Example: top articles via all en.wikipedia sites from June 12th, 2014 to August 30th, 2015: /top/en.wikipedia/all-access/2014-06-12/2015-08-30

(in all of those,

{project} means en.wikipedia, commons.wikimedia, etc.

{access} means access method as in desktop, mobile web, mobile app

)

Which do you prefer? Would any other query style be useful?

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

_______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Gabriel Wicke Principal Engineer, Wikimedia Foundation

-- Gabriel Wicke Principal Engineer, Wikimedia Foundation

Oliver Keyes

5:44 a.m.

Big +1 to Adam. Is the top articles the first deliverable we should expect?

On 11 September 2015 at 19:27, Gabriel Wicke gwicke@wikimedia.org wrote:

...

On Fri, Sep 11, 2015 at 4:26 PM, Gabriel Wicke gwicke@wikimedia.org wrote:

...
Another option would be a single entry point

/top/{project}/{access}/from/{start}{/end}

with support for negative indexes for 'days in the past':

/top/{project}/{access}/from/-30 /top/{project}/{access}/from/-60/-30

as well as full dates:

/top/en.wikipedia/all-access/2014-06-12/2015-08-30

Correction:

/top/en.wikipedia/all-access/from/2014-06-12/2015-08-30

...
On Fri, Sep 11, 2015 at 3:19 PM, paul@paulweiss.info wrote:

...
I concur with Leila.

Paul

--------- Original Message --------- Subject: Re: [Analytics] [Survey] Pageview API From: "Leila Zia" leila@wikimedia.org Date: 9/11/15 3:06 pm To: "A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics." analytics@lists.wikimedia.org

It's getting exciting. :-)

I'd go with choice 2 since it gives more control to the user while offering what the user can get through choice 1 as well.

Question: will we get page_ids or page_titles or both? It's good to have both.

Leila

On Fri, Sep 11, 2015 at 3:00 PM, Dan Andreescu dandreescu@wikimedia.org wrote:

...
Hi everyone. End of quarter is rapidly approaching and I wanted to ask a quick question about one of the endpoints we want to push out. We want to let you ask "what are the top articles" but we're not sure how to structure the URL so it's most useful to you. Here are the choices:

Choice 1. /top/{project}/{access}/{days-in-the-past}

Example: top articles via all en.wikipedia sites for the past 30 days: /top/en.wikipedia/all-access/30

Choice 2. /top/{project}/{access}/{start}/{end}

Example: top articles via all en.wikipedia sites from June 12th, 2014 to August 30th, 2015: /top/en.wikipedia/all-access/2014-06-12/2015-08-30

(in all of those,

{project} means en.wikipedia, commons.wikimedia, etc.

{access} means access method as in desktop, mobile web, mobile app

)

Which do you prefer? Would any other query style be useful?

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

_______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Gabriel Wicke Principal Engineer, Wikimedia Foundation

-- Gabriel Wicke Principal Engineer, Wikimedia Foundation

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Oliver Keyes Count Logula Wikimedia Foundation

Jonathan Morgan

6:08 a.m.

I would prefer to have both days-in-past and start/end daterange options, along the lines of Adam's proposal.

But if I have to choose one, I concur with Leila. Start/end daterange offers more functionality.

Jonathan

On Fri, Sep 11, 2015 at 4:44 PM, Oliver Keyes okeyes@wikimedia.org wrote:

...

Big +1 to Adam. Is the top articles the first deliverable we should expect?

On 11 September 2015 at 19:27, Gabriel Wicke gwicke@wikimedia.org wrote:

...
On Fri, Sep 11, 2015 at 4:26 PM, Gabriel Wicke gwicke@wikimedia.org

wrote:

...
...
Another option would be a single entry point

/top/{project}/{access}/from/{start}{/end}

with support for negative indexes for 'days in the past':

/top/{project}/{access}/from/-30 /top/{project}/{access}/from/-60/-30

as well as full dates:

/top/en.wikipedia/all-access/2014-06-12/2015-08-30

Correction:

/top/en.wikipedia/all-access/from/2014-06-12/2015-08-30

...
On Fri, Sep 11, 2015 at 3:19 PM, paul@paulweiss.info wrote:

...
I concur with Leila.

Paul

--------- Original Message --------- Subject: Re: [Analytics] [Survey] Pageview API From: "Leila Zia" leila@wikimedia.org Date: 9/11/15 3:06 pm To: "A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics." <

analytics@lists.wikimedia.org>

...
...
...
It's getting exciting. :-)

I'd go with choice 2 since it gives more control to the user while offering what the user can get through choice 1 as well.

Question: will we get page_ids or page_titles or both? It's good to

have

...
...
...
both.

Leila

On Fri, Sep 11, 2015 at 3:00 PM, Dan Andreescu <

dandreescu@wikimedia.org>

...
...
...
wrote:

...
Hi everyone. End of quarter is rapidly approaching and I wanted to

ask

...
...
...
...
a quick question about one of the endpoints we want to push out. We

want to

...
...
...
...
let you ask "what are the top articles" but we're not sure how to

structure

...
...
...
...
the URL so it's most useful to you. Here are the choices:

Choice 1. /top/{project}/{access}/{days-in-the-past}

Example: top articles via all en.wikipedia sites for the past 30 days: /top/en.wikipedia/all-access/30

Choice 2. /top/{project}/{access}/{start}/{end}

Example: top articles via all en.wikipedia sites from June 12th, 2014

to

...
...
...
...
August 30th, 2015: /top/en.wikipedia/all-access/2014-06-12/2015-08-30

(in all of those,

{project} means en.wikipedia, commons.wikimedia, etc.

{access} means access method as in desktop, mobile web, mobile app

)

Which do you prefer? Would any other query style be useful?

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

_______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Gabriel Wicke Principal Engineer, Wikimedia Foundation

-- Gabriel Wicke Principal Engineer, Wikimedia Foundation

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Oliver Keyes Count Logula Wikimedia Foundation

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Jonathan T. Morgan Senior Design Researcher Wikimedia Foundation User:Jmorgan (WMF) https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)

Sage Ross

6:11 a.m.

And, per Tgr on Phabricator, option 2 means that urls are stable, so I can link to some data and expect it to show the same data later on.

-Sage

On Fri, Sep 11, 2015 at 5:08 PM, Jonathan Morgan jmorgan@wikimedia.org wrote:

...

I would prefer to have both days-in-past and start/end daterange options, along the lines of Adam's proposal.

But if I have to choose one, I concur with Leila. Start/end daterange offers more functionality.

Jonathan

On Fri, Sep 11, 2015 at 4:44 PM, Oliver Keyes okeyes@wikimedia.org wrote:

...
Big +1 to Adam. Is the top articles the first deliverable we should expect?

On 11 September 2015 at 19:27, Gabriel Wicke gwicke@wikimedia.org wrote:

...
On Fri, Sep 11, 2015 at 4:26 PM, Gabriel Wicke gwicke@wikimedia.org wrote:

...
Another option would be a single entry point

/top/{project}/{access}/from/{start}{/end}

with support for negative indexes for 'days in the past':

/top/{project}/{access}/from/-30 /top/{project}/{access}/from/-60/-30

as well as full dates:

/top/en.wikipedia/all-access/2014-06-12/2015-08-30

Correction:

/top/en.wikipedia/all-access/from/2014-06-12/2015-08-30

...
On Fri, Sep 11, 2015 at 3:19 PM, paul@paulweiss.info wrote:

...
I concur with Leila.

Paul

--------- Original Message --------- Subject: Re: [Analytics] [Survey] Pageview API From: "Leila Zia" leila@wikimedia.org Date: 9/11/15 3:06 pm To: "A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics." analytics@lists.wikimedia.org

It's getting exciting. :-)

I'd go with choice 2 since it gives more control to the user while offering what the user can get through choice 1 as well.

Question: will we get page_ids or page_titles or both? It's good to have both.

Leila

On Fri, Sep 11, 2015 at 3:00 PM, Dan Andreescu dandreescu@wikimedia.org wrote:

...
Hi everyone. End of quarter is rapidly approaching and I wanted to ask a quick question about one of the endpoints we want to push out. We want to let you ask "what are the top articles" but we're not sure how to structure the URL so it's most useful to you. Here are the choices:

Choice 1. /top/{project}/{access}/{days-in-the-past}

Example: top articles via all en.wikipedia sites for the past 30 days: /top/en.wikipedia/all-access/30

Choice 2. /top/{project}/{access}/{start}/{end}

Example: top articles via all en.wikipedia sites from June 12th, 2014 to August 30th, 2015: /top/en.wikipedia/all-access/2014-06-12/2015-08-30

(in all of those,

{project} means en.wikipedia, commons.wikimedia, etc.

{access} means access method as in desktop, mobile web, mobile app

)

Which do you prefer? Would any other query style be useful?

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

_______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Gabriel Wicke Principal Engineer, Wikimedia Foundation

-- Gabriel Wicke Principal Engineer, Wikimedia Foundation

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Oliver Keyes Count Logula Wikimedia Foundation

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Jonathan T. Morgan Senior Design Researcher Wikimedia Foundation User:Jmorgan (WMF)

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Timo Tijhof

13 Sep 13 Sep

10:12 a.m.

I'd also recommend doing both. +1 to the schema proposed by Gabriel.

I'm sure it'll come up eventually but, just a few thoughts:

* Both can be cached (one can be cached for <24h, the other longer). * The dynamic range allows useful linking to answer canonical questions regarding current trends. * The dynamic range can either be a redirect resolved by the web app, or it can simply do the response directly (I recommend the latter; but either can cache for <24h). It should probably specify "link rel=canonical" (HTTP header or HTML tag, for machines) with the expanded url, and in case of a UI it can advertise this as the "Permalink" (for humans).

-- Timo

On Sat, Sep 12, 2015 at 1:11 AM, Sage Ross ragesoss+wikipedia@gmail.com wrote:

...

And, per Tgr on Phabricator, option 2 means that urls are stable, so I can link to some data and expect it to show the same data later on.

-Sage

On Fri, Sep 11, 2015 at 5:08 PM, Jonathan Morgan jmorgan@wikimedia.org wrote:

...
I would prefer to have both days-in-past and start/end daterange options, along the lines of Adam's proposal.

But if I have to choose one, I concur with Leila. Start/end daterange

offers

...
more functionality.

Jonathan

On Fri, Sep 11, 2015 at 4:44 PM, Oliver Keyes okeyes@wikimedia.org

wrote:

...
...
Big +1 to Adam. Is the top articles the first deliverable we should expect?

On 11 September 2015 at 19:27, Gabriel Wicke gwicke@wikimedia.org

wrote:

...
...
...
On Fri, Sep 11, 2015 at 4:26 PM, Gabriel Wicke gwicke@wikimedia.org wrote:

...
Another option would be a single entry point

/top/{project}/{access}/from/{start}{/end}

with support for negative indexes for 'days in the past':

/top/{project}/{access}/from/-30 /top/{project}/{access}/from/-60/-30

as well as full dates:

/top/en.wikipedia/all-access/2014-06-12/2015-08-30

Correction:

/top/en.wikipedia/all-access/from/2014-06-12/2015-08-30

...
On Fri, Sep 11, 2015 at 3:19 PM, paul@paulweiss.info wrote:

...
I concur with Leila.

Paul

--------- Original Message --------- Subject: Re: [Analytics] [Survey] Pageview API From: "Leila Zia" leila@wikimedia.org Date: 9/11/15 3:06 pm To: "A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics." analytics@lists.wikimedia.org

It's getting exciting. :-)

I'd go with choice 2 since it gives more control to the user while offering what the user can get through choice 1 as well.

Question: will we get page_ids or page_titles or both? It's good to have both.

Leila

On Fri, Sep 11, 2015 at 3:00 PM, Dan Andreescu dandreescu@wikimedia.org wrote: > > Hi everyone. End of quarter is rapidly approaching and I wanted to > ask > a quick question about one of the endpoints we want to push out.

We

...
...
...
...
...
> want to > let you ask "what are the top articles" but we're not sure how to > structure > the URL so it's most useful to you. Here are the choices: > > Choice 1. /top/{project}/{access}/{days-in-the-past} > > Example: top articles via all en.wikipedia sites for the past 30 > days: > /top/en.wikipedia/all-access/30 > > > Choice 2. /top/{project}/{access}/{start}/{end} > > Example: top articles via all en.wikipedia sites from June 12th,

2014

...
...
...
...
...
> to > August 30th, 2015:

/top/en.wikipedia/all-access/2014-06-12/2015-08-30

...
...
...
...
...
> > > (in all of those, > > * {project} means en.wikipedia, commons.wikimedia, etc. > * {access} means access method as in desktop, mobile web, mobile

app

...
...
...
...
...
> > ) > > Which do you prefer? Would any other query style be useful? > > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics

_______________________________________________ Analytics mailing

list

...
...
...
...
...
Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Gabriel Wicke Principal Engineer, Wikimedia Foundation

-- Gabriel Wicke Principal Engineer, Wikimedia Foundation

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Oliver Keyes Count Logula Wikimedia Foundation

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Jonathan T. Morgan Senior Design Researcher Wikimedia Foundation User:Jmorgan (WMF)

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Toby Negrin

12 Sep 12 Sep

5:46 a.m.

This seems like a weird way to use restful URLs. Why not parameters?

-Toby

On Fri, Sep 11, 2015 at 4:27 PM, Gabriel Wicke gwicke@wikimedia.org wrote:

...

On Fri, Sep 11, 2015 at 4:26 PM, Gabriel Wicke gwicke@wikimedia.org wrote:

...
Another option would be a single entry point

/top/{project}/{access}/from/{start}{/end}

with support for negative indexes for 'days in the past':

/top/{project}/{access}/from/-30 /top/{project}/{access}/from/-60/-30

as well as full dates:

/top/en.wikipedia/all-access/2014-06-12/2015-08-30

Correction:

/top/en.wikipedia/all-access/from/2014-06-12/2015-08-30

...
On Fri, Sep 11, 2015 at 3:19 PM, paul@paulweiss.info wrote:

...
I concur with Leila.

Paul

--------- Original Message --------- Subject: Re: [Analytics] [Survey] Pageview API From: "Leila Zia" leila@wikimedia.org Date: 9/11/15 3:06 pm To: "A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics." analytics@lists.wikimedia.org

It's getting exciting. :-)

I'd go with choice 2 since it gives more control to the user while offering what the user can get through choice 1 as well.

Question: will we get page_ids or page_titles or both? It's good to have both.

Leila

On Fri, Sep 11, 2015 at 3:00 PM, Dan Andreescu <dandreescu@wikimedia.org

...
wrote:

...
Hi everyone. End of quarter is rapidly approaching and I wanted to ask a quick question about one of the endpoints we want to push out. We want to let you ask "what are the top articles" but we're not sure how to structure the URL so it's most useful to you. Here are the choices:

Choice 1. /top/{project}/{access}/{days-in-the-past}

Example: top articles via all en.wikipedia sites for the past 30 days: /top/en.wikipedia/all-access/30

Choice 2. /top/{project}/{access}/{start}/{end}

Example: top articles via all en.wikipedia sites from June 12th, 2014 to August 30th, 2015: /top/en.wikipedia/all-access/2014-06-12/2015-08-30

(in all of those,

{project} means en.wikipedia, commons.wikimedia, etc.

{access} means access method as in desktop, mobile web, mobile app

)

Which do you prefer? Would any other query style be useful?

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

_______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Gabriel Wicke Principal Engineer, Wikimedia Foundation

-- Gabriel Wicke Principal Engineer, Wikimedia Foundation

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Gabriel Wicke

6:09 a.m.

Toby, main reason for REST paths over query strings is typically caching. With query strings and multiple parameters, the order and presence of parameters is not deterministic. You can use ?from=something&to=somethingElse or ?to=somethingElse&from=something, which both would be separate cache entries, which is an issue if you plan to cache for longer times & purge actively.

In this particular case it should actually be fine to rely on short time caching only, which means that query parameters are an option as well.

On Fri, Sep 11, 2015 at 4:46 PM, Toby Negrin tnegrin@wikimedia.org wrote:

...

This seems like a weird way to use restful URLs. Why not parameters?

-Toby

On Fri, Sep 11, 2015 at 4:27 PM, Gabriel Wicke gwicke@wikimedia.org wrote:

...
On Fri, Sep 11, 2015 at 4:26 PM, Gabriel Wicke gwicke@wikimedia.org wrote:

...
Another option would be a single entry point

/top/{project}/{access}/from/{start}{/end}

with support for negative indexes for 'days in the past':

/top/{project}/{access}/from/-30 /top/{project}/{access}/from/-60/-30

as well as full dates:

/top/en.wikipedia/all-access/2014-06-12/2015-08-30

Correction:

/top/en.wikipedia/all-access/from/2014-06-12/2015-08-30

...
On Fri, Sep 11, 2015 at 3:19 PM, paul@paulweiss.info wrote:

...
I concur with Leila.

Paul

--------- Original Message --------- Subject: Re: [Analytics] [Survey] Pageview API From: "Leila Zia" leila@wikimedia.org Date: 9/11/15 3:06 pm To: "A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics." <analytics@lists.wikimedia.org

...
It's getting exciting. :-)

I'd go with choice 2 since it gives more control to the user while offering what the user can get through choice 1 as well.

Question: will we get page_ids or page_titles or both? It's good to have both.

Leila

On Fri, Sep 11, 2015 at 3:00 PM, Dan Andreescu <dandreescu@wikimedia .org> wrote:

...
Hi everyone. End of quarter is rapidly approaching and I wanted to ask a quick question about one of the endpoints we want to push out. We want to let you ask "what are the top articles" but we're not sure how to structure the URL so it's most useful to you. Here are the choices:

Choice 1. /top/{project}/{access}/{days-in-the-past}

Example: top articles via all en.wikipedia sites for the past 30 days: /top/en.wikipedia/all-access/30

Choice 2. /top/{project}/{access}/{start}/{end}

Example: top articles via all en.wikipedia sites from June 12th, 2014 to August 30th, 2015: /top/en.wikipedia/all-access/2014-06-12/2015-08-30

(in all of those,

{project} means en.wikipedia, commons.wikimedia, etc.

{access} means access method as in desktop, mobile web, mobile app

)

Which do you prefer? Would any other query style be useful?

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

_______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Gabriel Wicke Principal Engineer, Wikimedia Foundation

-- Gabriel Wicke Principal Engineer, Wikimedia Foundation

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Gabriel Wicke Principal Engineer, Wikimedia Foundation

Thomas Steiner

13 Sep 13 Sep

2:37 p.m.

Hi all,

Two additional questions: (i) Are there plans for making this data available via this API at lower granularity (hourly, or even more fine grained, or even in streaming realtime form)? (ii) Are there plans for adding time zone support?

Thanks, Tom

-- Dr. Thomas Steiner, Employee, Google Inc. http://blog.tomayac.com, http://twitter.com/tomayac -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) iFy0uwAntT0bE3xtRa5AfeCheCkthAtTh3reSabiGbl0ck0fjumBl3DCharaCTersAttH3b0ttom.hTtP5://xKcd.c0m/1181/ -----END PGP SIGNATURE-----

Oliver Keyes

8:41 p.m.

By time zone support do you mean localising the server-side timestamps to the client location, and making the data available in a form divided-up like that?

On 13 September 2015 at 04:37, Thomas Steiner tomac@google.com wrote:

...

Hi all,

Two additional questions: (i) Are there plans for making this data available via this API at lower granularity (hourly, or even more fine grained, or even in streaming realtime form)? (ii) Are there plans for adding time zone support?

Thanks, Tom

-- Dr. Thomas Steiner, Employee, Google Inc. http://blog.tomayac.com, http://twitter.com/tomayac

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux)

iFy0uwAntT0bE3xtRa5AfeCheCkthAtTh3reSabiGbl0ck0fjumBl3DCharaCTersAttH3b0ttom.hTtP5://xKcd.c0m/1181/ -----END PGP SIGNATURE-----

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Oliver Keyes Count Logula Wikimedia Foundation

Thomas Steiner

9:26 p.m.

I mean that somehow I could express getting data in an exact given period of time, say, exactly the day September 11, 2015 in the time zone CET (that day started at 3pm relative to PDT or 11pm relative to UTC). Without time zone support, I would get data “outside” of my desired local time zone. Hope this makes sense and is clear.

Andrew Gray

9:39 p.m.

On 13 September 2015 at 16:26, Thomas Steiner tomac@google.com wrote:

...

I mean that somehow I could express getting data in an exact given period of time, say, exactly the day September 11, 2015 in the time zone CET (that day started at 3pm relative to PDT or 11pm relative to UTC). Without time zone support, I would get data “outside” of my desired local time zone. Hope this makes sense and is clear.

A cautious note on time zones...

If you're holding everything in one hour bins, as we currently do with the aggregated data, then it's easy enough to switch from UTC to CET to EST and so forth.

But not all time zones differ by one hour increments. Most noticeably, India is on UTC+5:30, and a handful of other places also differ by 30 minutes from the standard (or in the case of Nepal, 45). I'm not sure you could display these without regenerating the underlying data, which would be a lot of added complexity.

-- - Andrew Gray andrew.gray@dunelm.org.uk

Joseph Allemandou

14 Sep 14 Sep

6:09 p.m.

Hi all,

My thoughts and opinion around entry-point definition.

While we have as a long-term plan to provide 'on-the-fly per-query computation', for now we pre-aggregate every dataset we want serve, and store it in cassandra to be exposed by restbase. It means we can't easily provide variable start/end aggregation easily.

We could either - send every dataset in between the start and end date for a given time granularity level (could be big !). - use '/top/{project}/{access}/{year}/{month}/{day}' entrypoint for instance, with possibility to skip the 'day' parameter to have full month.

*@Thomas*: - As Andrew said, the data we have is pre-aggregated at hour level so far. - The data is tagged in UTC timezone and we planned that requests would be using that timezone dy default. - As said in this message, we are thinking of ways to provide better access to data (on the fly computation, lower time granularity and others), and this involves both technical and privacy concern. It will be for future :)

Joseph

On Sun, Sep 13, 2015 at 5:39 PM, Andrew Gray andrew.gray@dunelm.org.uk wrote:

...

On 13 September 2015 at 16:26, Thomas Steiner tomac@google.com wrote:

...
I mean that somehow I could express getting data in an exact given

period of

...
time, say, exactly the day September 11, 2015 in the time zone CET (that

day

...
started at 3pm relative to PDT or 11pm relative to UTC). Without time

zone

...
support, I would get data “outside” of my desired local time zone. Hope

this

...
makes sense and is clear.

A cautious note on time zones...

If you're holding everything in one hour bins, as we currently do with the aggregated data, then it's easy enough to switch from UTC to CET to EST and so forth.

But not all time zones differ by one hour increments. Most noticeably, India is on UTC+5:30, and a handful of other places also differ by 30 minutes from the standard (or in the case of Nepal, 45). I'm not sure you could display these without regenerating the underlying data, which would be a lot of added complexity.

--

Andrew Gray andrew.gray@dunelm.org.uk

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- *Joseph Allemandou* Data Engineer @ Wikimedia Foundation IRC: joal

3355

Age (days ago)

3366

Last active (days ago)

analytics@lists.wikimedia.org

37 comments

17 participants

tags (0)

participants (17)

Adam Baso
Andrew Gray
Andrew Otto
Dan Andreescu
Erik Bernhardson
Gabriel Wicke
Jonathan Morgan
Joseph Allemandou
Leila Zia
Marcel Ruiz Forns
Marko Obrovac
Oliver Keyes
paul＠paulweiss.info
Sage Ross
Thomas Steiner
Timo Tijhof
Toby Negrin