Thank you all for your thoughtful opinions.
Since people want to know the top pages over an arbitrary time period, we think Druid would be the best back-end for that kind of query. But we're not going to push that for the first release. It's very useful to know that's the consensus, we can now start talking to Jaime Crespo about Druid / alternatives, make plans, etc. Until then, the first release is going to have the top endpoint that Joseph wrote about. This is easy to pre-aggregate and dump into Cassandra. Also, the /v1/pageviews/ prefix is going to be on all the endpoints we launch with, because these are endpoints in a "pageviews" RESTBase module. So we'll have:
/v1/pageviews/top/{project}/{access}/{year}/{month}/{day}
for now, with {month} and {day} being optional parameters. This will give you the top pageviews for the selected calendar date. And as soon as we can, we'll have:
/v1/pageviews/top/{project}/{access}/from/{start}{/end}
As proposed by Gabriel, with {start} and {end} taking both full dates and "now"-relative negative integers.
The initial endpoint we launch won't have hourly resolution, that seems like too much data to pre-aggregate. But we'll see how Druid handles very specific dates (should be fine) and make that a feature in the second version. We'll have to look into the privacy implications of short time ranges, like an hour.
On Mon, Sep 14, 2015 at 10:18 AM, Andrew Otto aotto@wikimedia.org wrote:
Also, maybe *top-articles* instead of *top*, to avoid naming collision in the future?
+1 for prefixing whatever paths you are doing now with something relevant. I sense that there might be more than just pageview data in the future.
/pageviews/top/…?
On Sep 11, 2015, at 18:38, Marcel Ruiz Forns mforns@wikimedia.org wrote:
+1 Adam
Also, maybe *top-articles* instead of *top*, to avoid naming collision in the future?
On Sat, Sep 12, 2015 at 12:27 AM, Adam Baso abaso@wikimedia.org wrote:
I'd be in favor of both. Maybe with a little tweak to the pathing:
/top/{project}/{access}/days/{days-in-the-past}
/top/{project}/{access}/range/{start}/{end}
with "days" or "range" maybe being earlier in the forward slash separated spec if it doesn't read well semantically.
On Fri, Sep 11, 2015 at 3:14 PM, Dan Andreescu dandreescu@wikimedia.org wrote:
It wouldn't be too hard to offer both, but I'm thinking it might be confusing for a consumer. I think ultimately the decision should be up to the people using this data, because the use cases are fairly different for each form. If people ask for both, we'll do both.
Leila, we'd love to have page_ids as well, but we'd have to block the release on a bigger effort to reliably mirror mediawiki databases in Hadoop for processing, so we'll probably punt on that for now. But we have more than many reasons to work on that sooner than later.
On Fri, Sep 11, 2015 at 6:09 PM, Gabriel Wicke gwicke@wikimedia.org wrote:
The former might be slightly easier to cache, and can be linked to / pulled in statically, without a need to dynamically construct a URL. Would it be hard to offer both?
On Fri, Sep 11, 2015 at 3:06 PM, Leila Zia leila@wikimedia.org wrote:
It's getting exciting. :-)
I'd go with choice 2 since it gives more control to the user while offering what the user can get through choice 1 as well.
Question: will we get page_ids or page_titles or both? It's good to have both.
Leila
On Fri, Sep 11, 2015 at 3:00 PM, Dan Andreescu <dandreescu@wikimedia .org> wrote:
Hi everyone. End of quarter is rapidly approaching and I wanted to ask a quick question about one of the endpoints we want to push out. We want to let you ask "what are the top articles" but we're not sure how to structure the URL so it's most useful to you. Here are the choices:
Choice 1. /top/{project}/{access}/{days-in-the-past}
Example: top articles via all en.wikipedia sites for the past 30 days: /top/en.wikipedia/all-access/30
Choice 2. /top/{project}/{access}/{start}/{end}
Example: top articles via all en.wikipedia sites from June 12th, 2014 to August 30th, 2015: /top/en.wikipedia/all-access/2014-06-12/2015-08-30
(in all of those,
- {project} means en.wikipedia, commons.wikimedia, etc.
- {access} means access method as in desktop, mobile web, mobile app
)
Which do you prefer? Would any other query style be useful?
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Gabriel Wicke Principal Engineer, Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- *Marcel Ruiz Forns* Analytics Developer Wikimedia Foundation _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics