One thing that i want to bring up that may not be captured in the endpoint described above (but maybe exists in another endpoint? I havn't been following).  Within search we would like to integrate page view statistics into our completion suggestions api.  This indexes will be built once a week from a bulk process. Ideally we would like to be able to send over 100 or so titles and get page average hourly page views (random guess on exactly which, but something that indicates the relative popularity of the page) for the past week.

On Tue, Sep 15, 2015 at 6:43 AM, Andrew Otto <aotto@wikimedia.org> wrote:
+1 for just making the URI consistent and not supporting too many nice human edge cases :)


On Sep 15, 2015, at 06:57, Marko Obrovac <mobrovac@wikimedia.org> wrote:

Hello,

Gabriel, Dan and I are discussing this very same topic on T103811~[1,2,3], so please take a look there and weigh in!

As for the specific endpoints, perhaps it'd be worth switching the places of *top* and the project name to be more in line with the current public RESTful URI layout?

Also, I must admit I find the non-determinism of the endpoints confusing to some extent. Specifically I'm referring to the `/{start}/{end}` portion (or, in your notion, this should really be `/{start}{/end}` denoting that `{end}` is an optional URI parameter), the problem being exactly that `{end}` is optional and, if not supplied, the current date is assumed. That entails that the result of making a request to the endpoint without an end date (or TS) depends on the context (the context in this case being the time stamp of the request). So, one day the request encompasses a span of 24h, while the next that same request refers to a 48h period.

I do agree that this makes it easier for humans to issue requests ("Why would I need to write down today's date?"), but APIs are meant to be only *human-friendly*, not *for humans* (yes, there is a difference :P). What I mean is that it should feel natural for humans to create / programme calls to the API and then use these results in their applications/presentations/etc. In that context, there is literally no difference between:

- give me the list of top articles for the past 30 days (this is how the human asks the question)
- give me the list of top articles starting from 2015-08-15 (for an application, that's just a matter of computing `current_time() - 1m`)
- give me the list of top articles starting form 2015-08-15 and ending on 2015-09-15 (idem as above plus a call to `current_time()`)

Unless, of course, you target mostly human requests, in which case my argument is rendered moot :P

My 2 cents,
Marko



On 14 September 2015 at 16:53, Dan Andreescu <dandreescu@wikimedia.org> wrote:
Thank you all for your thoughtful opinions.

Since people want to know the top pages over an arbitrary time period, we think Druid would be the best back-end for that kind of query.  But we're not going to push that for the first release.  It's very useful to know that's the consensus, we can now start talking to Jaime Crespo about Druid / alternatives, make plans, etc.  Until then, the first release is going to have the top endpoint that Joseph wrote about.  This is easy to pre-aggregate and dump into Cassandra.  Also, the /v1/pageviews/ prefix is going to be on all the endpoints we launch with, because these are endpoints in a "pageviews" RESTBase module.  So we'll have:

/v1/pageviews/top/{project}/{access}/{year}/{month}/{day}

for now, with {month} and {day} being optional parameters.  This will give you the top pageviews for the selected calendar date.  And as soon as we can, we'll have:

/v1/pageviews/top/{project}/{access}/from/{start}{/end}

As proposed by Gabriel, with {start} and {end} taking both full dates and "now"-relative negative integers.

The initial endpoint we launch won't have hourly resolution, that seems like too much data to pre-aggregate.  But we'll see how Druid handles very specific dates (should be fine) and make that a feature in the second version.  We'll have to look into the privacy implications of short time ranges, like an hour.



On Mon, Sep 14, 2015 at 10:18 AM, Andrew Otto <aotto@wikimedia.org> wrote:
Also, maybe top-articles instead of top, to avoid naming collision in the future?
+1 for prefixing whatever paths you are doing now with something relevant.  I sense that there might be more than just pageview data in the future.

/pageviews/top/…?




On Sep 11, 2015, at 18:38, Marcel Ruiz Forns <mforns@wikimedia.org> wrote:

+1 Adam

Also, maybe top-articles instead of top, to avoid naming collision in the future?

On Sat, Sep 12, 2015 at 12:27 AM, Adam Baso <abaso@wikimedia.org> wrote:
I'd be in favor of both. Maybe with a little tweak to the pathing:

/top/{project}/{access}/days/{days-in-the-past}

 /top/{project}/{access}/range/{start}/{end}

with "days" or "range" maybe being earlier in the forward slash separated spec if it doesn't read well semantically.


On Fri, Sep 11, 2015 at 3:14 PM, Dan Andreescu <dandreescu@wikimedia.org> wrote:
It wouldn't be too hard to offer both, but I'm thinking it might be confusing for a consumer.  I think ultimately the decision should be up to the people using this data, because the use cases are fairly different for each form.  If people ask for both, we'll do both.

Leila, we'd love to have page_ids as well, but we'd have to block the release on a bigger effort to reliably mirror mediawiki databases in Hadoop for processing, so we'll probably punt on that for now.  But we have more than many reasons to work on that sooner than later.

On Fri, Sep 11, 2015 at 6:09 PM, Gabriel Wicke <gwicke@wikimedia.org> wrote:
The former might be slightly easier to cache, and can be linked to / pulled in statically, without a need to dynamically construct a URL. Would it be hard to offer both?

On Fri, Sep 11, 2015 at 3:06 PM, Leila Zia <leila@wikimedia.org> wrote:
It's getting exciting. :-)

I'd go with choice 2 since it gives more control to the user while offering what the user can get through choice 1 as well.

Question: will we get page_ids or page_titles or both? It's good to have both.

Leila

On Fri, Sep 11, 2015 at 3:00 PM, Dan Andreescu <dandreescu@wikimedia.org> wrote:
Hi everyone.  End of quarter is rapidly approaching and I wanted to ask a quick question about one of the endpoints we want to push out.  We want to let you ask "what are the top articles" but we're not sure how to structure the URL so it's most useful to you.  Here are the choices:

Choice 1. /top/{project}/{access}/{days-in-the-past}

Example: top articles via all en.wikipedia sites for the past 30 days: /top/en.wikipedia/all-access/30


Choice 2. /top/{project}/{access}/{start}/{end}

Example: top articles via all en.wikipedia sites from June 12th, 2014 to August 30th, 2015: /top/en.wikipedia/all-access/2014-06-12/2015-08-30


(in all of those,

* {project} means en.wikipedia, commons.wikimedia, etc.
* {access} means access method as in desktop, mobile web, mobile app

)

Which do you prefer?  Would any other query style be useful?

_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics



_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics




--
Gabriel Wicke
Principal Engineer, Wikimedia Foundation

_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics



_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics



_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics




--
Marcel Ruiz Forns
Analytics Developer
Wikimedia Foundation
_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics



_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics




--
Marko Obrovac, PhD
Senior Services Engineer
Wikimedia Foundation
_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics