+1 for just making the URI consistent and not supporting too many nice human edge cases
:)
On Sep 15, 2015, at 06:57, Marko Obrovac
<mobrovac(a)wikimedia.org> wrote:
Hello,
Gabriel, Dan and I are discussing this very same topic on T103811~[1,2,3], so please take
a look there and weigh in!
As for the specific endpoints, perhaps it'd be worth switching the places of *top*
and the project name to be more in line with the current public RESTful URI layout?
Also, I must admit I find the non-determinism of the endpoints confusing to some extent.
Specifically I'm referring to the `/{start}/{end}` portion (or, in your notion, this
should really be `/{start}{/end}` denoting that `{end}` is an optional URI parameter), the
problem being exactly that `{end}` is optional and, if not supplied, the current date is
assumed. That entails that the result of making a request to the endpoint without an end
date (or TS) depends on the context (the context in this case being the time stamp of the
request). So, one day the request encompasses a span of 24h, while the next that same
request refers to a 48h period.
I do agree that this makes it easier for humans to issue requests ("Why would I need
to write down today's date?"), but APIs are meant to be only *human-friendly*,
not *for humans* (yes, there is a difference :P). What I mean is that it should feel
natural for humans to create / programme calls to the API and then use these results in
their applications/presentations/etc. In that context, there is literally no difference
between:
- give me the list of top articles for the past 30 days (this is how the human asks the
question)
- give me the list of top articles starting from 2015-08-15 (for an application,
that's just a matter of computing `current_time() - 1m`)
- give me the list of top articles starting form 2015-08-15 and ending on 2015-09-15
(idem as above plus a call to `current_time()`)
Unless, of course, you target mostly human requests, in which case my argument is
rendered moot :P
My 2 cents,
Marko
[1]
https://phabricator.wikimedia.org/T103811
<https://phabricator.wikimedia.org/T103811>
[2]
https://phabricator.wikimedia.org/T103811#1639417
<https://phabricator.wikimedia.org/T103811#1639417>
[3]
https://phabricator.wikimedia.org/T103811#1640977
<https://phabricator.wikimedia.org/T103811#1640977>
On 14 September 2015 at 16:53, Dan Andreescu <dandreescu(a)wikimedia.org
<mailto:dandreescu@wikimedia.org>> wrote:
Thank you all for your thoughtful opinions.
Since people want to know the top pages over an arbitrary time period, we think Druid
would be the best back-end for that kind of query. But we're not going to push that
for the first release. It's very useful to know that's the consensus, we can now
start talking to Jaime Crespo about Druid / alternatives, make plans, etc. Until then,
the first release is going to have the top endpoint that Joseph wrote about. This is easy
to pre-aggregate and dump into Cassandra. Also, the /v1/pageviews/ prefix is going to be
on all the endpoints we launch with, because these are endpoints in a
"pageviews" RESTBase module. So we'll have:
/v1/pageviews/top/{project}/{access}/{year}/{month}/{day}
for now, with {month} and {day} being optional parameters. This will give you the top
pageviews for the selected calendar date. And as soon as we can, we'll have:
/v1/pageviews/top/{project}/{access}/from/{start}{/end}
As proposed by Gabriel, with {start} and {end} taking both full dates and
"now"-relative negative integers.
The initial endpoint we launch won't have hourly resolution, that seems like too much
data to pre-aggregate. But we'll see how Druid handles very specific dates (should be
fine) and make that a feature in the second version. We'll have to look into the
privacy implications of short time ranges, like an hour.
On Mon, Sep 14, 2015 at 10:18 AM, Andrew Otto <aotto(a)wikimedia.org
<mailto:aotto@wikimedia.org>> wrote:
Also, maybe top-articles instead of top, to avoid
naming collision in the future?
+1 for prefixing whatever paths you are doing now with something relevant. I sense that
there might be more than just pageview data in the future.
/pageviews/top/…?
> On Sep 11, 2015, at 18:38, Marcel Ruiz Forns <mforns(a)wikimedia.org
<mailto:mforns@wikimedia.org>> wrote:
>
> +1 Adam
>
Also, maybe top-articles instead of top, to avoid
naming collision in the future?
>
> On Sat, Sep 12, 2015 at 12:27 AM, Adam Baso <abaso(a)wikimedia.org
<mailto:abaso@wikimedia.org>> wrote:
> I'd be in favor of both. Maybe with a little tweak to the pathing:
>
> /top/{project}/{access}/days/{days-in-the-past}
>
> /top/{project}/{access}/range/{start}/{end}
>
> with "days" or "range" maybe being earlier in the forward slash
separated spec if it doesn't read well semantically.
>
>
> On Fri, Sep 11, 2015 at 3:14 PM, Dan Andreescu <dandreescu(a)wikimedia.org
<mailto:dandreescu@wikimedia.org>> wrote:
> It wouldn't be too hard to offer both, but I'm thinking it might be confusing
for a consumer. I think ultimately the decision should be up to the people using this
data, because the use cases are fairly different for each form. If people ask for both,
we'll do both.
>
> Leila, we'd love to have page_ids as well, but we'd have to block the release
on a bigger effort to reliably mirror mediawiki databases in Hadoop for processing, so
we'll probably punt on that for now. But we have more than many reasons to work on
that sooner than later.
>
> On Fri, Sep 11, 2015 at 6:09 PM, Gabriel Wicke <gwicke(a)wikimedia.org
<mailto:gwicke@wikimedia.org>> wrote:
> The former might be slightly easier to cache, and can be linked to / pulled in
statically, without a need to dynamically construct a URL. Would it be hard to offer
both?
>
> On Fri, Sep 11, 2015 at 3:06 PM, Leila Zia <leila(a)wikimedia.org
<mailto:leila@wikimedia.org>> wrote:
> It's getting exciting. :-)
>
> I'd go with choice 2 since it gives more control to the user while offering what
the user can get through choice 1 as well.
>
> Question: will we get page_ids or page_titles or both? It's good to have both.
>
> Leila
>
> On Fri, Sep 11, 2015 at 3:00 PM, Dan Andreescu <dandreescu(a)wikimedia.org
<mailto:dandreescu@wikimedia.org>> wrote:
> Hi everyone. End of quarter is rapidly approaching and I wanted to ask a quick
question about one of the endpoints we want to push out. We want to let you ask
"what are the top articles" but we're not sure how to structure the URL so
it's most useful to you. Here are the choices:
>
> Choice 1. /top/{project}/{access}/{days-in-the-past}
>
> Example: top articles via all en.wikipedia sites for the past 30 days:
/top/en.wikipedia/all-access/30
>
>
> Choice 2. /top/{project}/{access}/{start}/{end}
>
> Example: top articles via all en.wikipedia sites from June 12th, 2014 to August 30th,
2015: /top/en.wikipedia/all-access/2014-06-12/2015-08-30
>
>
> (in all of those,
>
> * {project} means en.wikipedia, commons.wikimedia, etc.
> * {access} means access method as in desktop, mobile web, mobile app
>
> )
>
> Which do you prefer? Would any other query style be useful?
>
> _______________________________________________
> Analytics mailing list
> Analytics(a)lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org>
>
https://lists.wikimedia.org/mailman/listinfo/analytics
<https://lists.wikimedia.org/mailman/listinfo/analytics>
>
>
>
> _______________________________________________
> Analytics mailing list
> Analytics(a)lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org>
>
https://lists.wikimedia.org/mailman/listinfo/analytics
<https://lists.wikimedia.org/mailman/listinfo/analytics>
>
>
>
>
> --
> Gabriel Wicke
> Principal Engineer, Wikimedia Foundation
>
> _______________________________________________
> Analytics mailing list
> Analytics(a)lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org>
>
https://lists.wikimedia.org/mailman/listinfo/analytics
<https://lists.wikimedia.org/mailman/listinfo/analytics>
>
>
>
> _______________________________________________
> Analytics mailing list
> Analytics(a)lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org>
>
https://lists.wikimedia.org/mailman/listinfo/analytics
<https://lists.wikimedia.org/mailman/listinfo/analytics>
>
>
>
> _______________________________________________
> Analytics mailing list
> Analytics(a)lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org>
>
https://lists.wikimedia.org/mailman/listinfo/analytics
<https://lists.wikimedia.org/mailman/listinfo/analytics>
>
>
>
>
> --
> Marcel Ruiz Forns
> Analytics Developer
> Wikimedia Foundation
> _______________________________________________
> Analytics mailing list
> Analytics(a)lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org>
>
https://lists.wikimedia.org/mailman/listinfo/analytics
<https://lists.wikimedia.org/mailman/listinfo/analytics>
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/analytics
<https://lists.wikimedia.org/mailman/listinfo/analytics>
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/analytics
<https://lists.wikimedia.org/mailman/listinfo/analytics>
--
Marko Obrovac, PhD
Senior Services Engineer
Wikimedia Foundation
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics