Hi everyone. End of quarter is rapidly approaching and I wanted to ask a quick question about one of the endpoints we want to push out. We want to let you ask "what are the top articles" but we're not sure how to structure the URL so it's most useful to you. Here are the choices:
Choice 1. /top/{project}/{access}/{days-in-the-past}
Example: top articles via all en.wikipedia sites for the past 30 days: /top/en.wikipedia/all-access/30
Choice 2. /top/{project}/{access}/{start}/{end}
Example: top articles via all en.wikipedia sites from June 12th, 2014 to August 30th, 2015: /top/en.wikipedia/all-access/2014-06-12/2015-08-30
(in all of those,
* {project} means en.wikipedia, commons.wikimedia, etc. * {access} means access method as in desktop, mobile web, mobile app
)
Which do you prefer? Would any other query style be useful?
It's getting exciting. :-)
I'd go with choice 2 since it gives more control to the user while offering what the user can get through choice 1 as well.
Question: will we get page_ids or page_titles or both? It's good to have both.
Leila
On Fri, Sep 11, 2015 at 3:00 PM, Dan Andreescu dandreescu@wikimedia.org wrote:
Hi everyone. End of quarter is rapidly approaching and I wanted to ask a quick question about one of the endpoints we want to push out. We want to let you ask "what are the top articles" but we're not sure how to structure the URL so it's most useful to you. Here are the choices:
Choice 1. /top/{project}/{access}/{days-in-the-past}
Example: top articles via all en.wikipedia sites for the past 30 days: /top/en.wikipedia/all-access/30
Choice 2. /top/{project}/{access}/{start}/{end}
Example: top articles via all en.wikipedia sites from June 12th, 2014 to August 30th, 2015: /top/en.wikipedia/all-access/2014-06-12/2015-08-30
(in all of those,
- {project} means en.wikipedia, commons.wikimedia, etc.
- {access} means access method as in desktop, mobile web, mobile app
)
Which do you prefer? Would any other query style be useful?
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
The former might be slightly easier to cache, and can be linked to / pulled in statically, without a need to dynamically construct a URL. Would it be hard to offer both?
On Fri, Sep 11, 2015 at 3:06 PM, Leila Zia leila@wikimedia.org wrote:
It's getting exciting. :-)
I'd go with choice 2 since it gives more control to the user while offering what the user can get through choice 1 as well.
Question: will we get page_ids or page_titles or both? It's good to have both.
Leila
On Fri, Sep 11, 2015 at 3:00 PM, Dan Andreescu dandreescu@wikimedia.org wrote:
Hi everyone. End of quarter is rapidly approaching and I wanted to ask a quick question about one of the endpoints we want to push out. We want to let you ask "what are the top articles" but we're not sure how to structure the URL so it's most useful to you. Here are the choices:
Choice 1. /top/{project}/{access}/{days-in-the-past}
Example: top articles via all en.wikipedia sites for the past 30 days: /top/en.wikipedia/all-access/30
Choice 2. /top/{project}/{access}/{start}/{end}
Example: top articles via all en.wikipedia sites from June 12th, 2014 to August 30th, 2015: /top/en.wikipedia/all-access/2014-06-12/2015-08-30
(in all of those,
- {project} means en.wikipedia, commons.wikimedia, etc.
- {access} means access method as in desktop, mobile web, mobile app
)
Which do you prefer? Would any other query style be useful?
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
It wouldn't be too hard to offer both, but I'm thinking it might be confusing for a consumer. I think ultimately the decision should be up to the people using this data, because the use cases are fairly different for each form. If people ask for both, we'll do both.
Leila, we'd love to have page_ids as well, but we'd have to block the release on a bigger effort to reliably mirror mediawiki databases in Hadoop for processing, so we'll probably punt on that for now. But we have more than many reasons to work on that sooner than later.
On Fri, Sep 11, 2015 at 6:09 PM, Gabriel Wicke gwicke@wikimedia.org wrote:
The former might be slightly easier to cache, and can be linked to / pulled in statically, without a need to dynamically construct a URL. Would it be hard to offer both?
On Fri, Sep 11, 2015 at 3:06 PM, Leila Zia leila@wikimedia.org wrote:
It's getting exciting. :-)
I'd go with choice 2 since it gives more control to the user while offering what the user can get through choice 1 as well.
Question: will we get page_ids or page_titles or both? It's good to have both.
Leila
On Fri, Sep 11, 2015 at 3:00 PM, Dan Andreescu dandreescu@wikimedia.org wrote:
Hi everyone. End of quarter is rapidly approaching and I wanted to ask a quick question about one of the endpoints we want to push out. We want to let you ask "what are the top articles" but we're not sure how to structure the URL so it's most useful to you. Here are the choices:
Choice 1. /top/{project}/{access}/{days-in-the-past}
Example: top articles via all en.wikipedia sites for the past 30 days: /top/en.wikipedia/all-access/30
Choice 2. /top/{project}/{access}/{start}/{end}
Example: top articles via all en.wikipedia sites from June 12th, 2014 to August 30th, 2015: /top/en.wikipedia/all-access/2014-06-12/2015-08-30
(in all of those,
- {project} means en.wikipedia, commons.wikimedia, etc.
- {access} means access method as in desktop, mobile web, mobile app
)
Which do you prefer? Would any other query style be useful?
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Gabriel Wicke Principal Engineer, Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
I'd be in favor of both. Maybe with a little tweak to the pathing:
/top/{project}/{access}/days/{days-in-the-past}
/top/{project}/{access}/range/{start}/{end}
with "days" or "range" maybe being earlier in the forward slash separated spec if it doesn't read well semantically.
On Fri, Sep 11, 2015 at 3:14 PM, Dan Andreescu dandreescu@wikimedia.org wrote:
It wouldn't be too hard to offer both, but I'm thinking it might be confusing for a consumer. I think ultimately the decision should be up to the people using this data, because the use cases are fairly different for each form. If people ask for both, we'll do both.
Leila, we'd love to have page_ids as well, but we'd have to block the release on a bigger effort to reliably mirror mediawiki databases in Hadoop for processing, so we'll probably punt on that for now. But we have more than many reasons to work on that sooner than later.
On Fri, Sep 11, 2015 at 6:09 PM, Gabriel Wicke gwicke@wikimedia.org wrote:
The former might be slightly easier to cache, and can be linked to / pulled in statically, without a need to dynamically construct a URL. Would it be hard to offer both?
On Fri, Sep 11, 2015 at 3:06 PM, Leila Zia leila@wikimedia.org wrote:
It's getting exciting. :-)
I'd go with choice 2 since it gives more control to the user while offering what the user can get through choice 1 as well.
Question: will we get page_ids or page_titles or both? It's good to have both.
Leila
On Fri, Sep 11, 2015 at 3:00 PM, Dan Andreescu <dandreescu@wikimedia.org
wrote:
Hi everyone. End of quarter is rapidly approaching and I wanted to ask a quick question about one of the endpoints we want to push out. We want to let you ask "what are the top articles" but we're not sure how to structure the URL so it's most useful to you. Here are the choices:
Choice 1. /top/{project}/{access}/{days-in-the-past}
Example: top articles via all en.wikipedia sites for the past 30 days: /top/en.wikipedia/all-access/30
Choice 2. /top/{project}/{access}/{start}/{end}
Example: top articles via all en.wikipedia sites from June 12th, 2014 to August 30th, 2015: /top/en.wikipedia/all-access/2014-06-12/2015-08-30
(in all of those,
- {project} means en.wikipedia, commons.wikimedia, etc.
- {access} means access method as in desktop, mobile web, mobile app
)
Which do you prefer? Would any other query style be useful?
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Gabriel Wicke Principal Engineer, Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
+1 Adam
Also, maybe *top-articles* instead of *top*, to avoid naming collision in the future?
On Sat, Sep 12, 2015 at 12:27 AM, Adam Baso abaso@wikimedia.org wrote:
I'd be in favor of both. Maybe with a little tweak to the pathing:
/top/{project}/{access}/days/{days-in-the-past}
/top/{project}/{access}/range/{start}/{end}
with "days" or "range" maybe being earlier in the forward slash separated spec if it doesn't read well semantically.
On Fri, Sep 11, 2015 at 3:14 PM, Dan Andreescu dandreescu@wikimedia.org wrote:
It wouldn't be too hard to offer both, but I'm thinking it might be confusing for a consumer. I think ultimately the decision should be up to the people using this data, because the use cases are fairly different for each form. If people ask for both, we'll do both.
Leila, we'd love to have page_ids as well, but we'd have to block the release on a bigger effort to reliably mirror mediawiki databases in Hadoop for processing, so we'll probably punt on that for now. But we have more than many reasons to work on that sooner than later.
On Fri, Sep 11, 2015 at 6:09 PM, Gabriel Wicke gwicke@wikimedia.org wrote:
The former might be slightly easier to cache, and can be linked to / pulled in statically, without a need to dynamically construct a URL. Would it be hard to offer both?
On Fri, Sep 11, 2015 at 3:06 PM, Leila Zia leila@wikimedia.org wrote:
It's getting exciting. :-)
I'd go with choice 2 since it gives more control to the user while offering what the user can get through choice 1 as well.
Question: will we get page_ids or page_titles or both? It's good to have both.
Leila
On Fri, Sep 11, 2015 at 3:00 PM, Dan Andreescu <dandreescu@wikimedia .org> wrote:
Hi everyone. End of quarter is rapidly approaching and I wanted to ask a quick question about one of the endpoints we want to push out. We want to let you ask "what are the top articles" but we're not sure how to structure the URL so it's most useful to you. Here are the choices:
Choice 1. /top/{project}/{access}/{days-in-the-past}
Example: top articles via all en.wikipedia sites for the past 30 days: /top/en.wikipedia/all-access/30
Choice 2. /top/{project}/{access}/{start}/{end}
Example: top articles via all en.wikipedia sites from June 12th, 2014 to August 30th, 2015: /top/en.wikipedia/all-access/2014-06-12/2015-08-30
(in all of those,
- {project} means en.wikipedia, commons.wikimedia, etc.
- {access} means access method as in desktop, mobile web, mobile app
)
Which do you prefer? Would any other query style be useful?
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Gabriel Wicke Principal Engineer, Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Also, maybe top-articles instead of top, to avoid naming collision in the future?
+1 for prefixing whatever paths you are doing now with something relevant. I sense that there might be more than just pageview data in the future.
/pageviews/top/…?
On Sep 11, 2015, at 18:38, Marcel Ruiz Forns mforns@wikimedia.org wrote:
+1 Adam
Also, maybe top-articles instead of top, to avoid naming collision in the future?
On Sat, Sep 12, 2015 at 12:27 AM, Adam Baso <abaso@wikimedia.org mailto:abaso@wikimedia.org> wrote: I'd be in favor of both. Maybe with a little tweak to the pathing:
/top/{project}/{access}/days/{days-in-the-past}
/top/{project}/{access}/range/{start}/{end}
with "days" or "range" maybe being earlier in the forward slash separated spec if it doesn't read well semantically.
On Fri, Sep 11, 2015 at 3:14 PM, Dan Andreescu <dandreescu@wikimedia.org mailto:dandreescu@wikimedia.org> wrote: It wouldn't be too hard to offer both, but I'm thinking it might be confusing for a consumer. I think ultimately the decision should be up to the people using this data, because the use cases are fairly different for each form. If people ask for both, we'll do both.
Leila, we'd love to have page_ids as well, but we'd have to block the release on a bigger effort to reliably mirror mediawiki databases in Hadoop for processing, so we'll probably punt on that for now. But we have more than many reasons to work on that sooner than later.
On Fri, Sep 11, 2015 at 6:09 PM, Gabriel Wicke <gwicke@wikimedia.org mailto:gwicke@wikimedia.org> wrote: The former might be slightly easier to cache, and can be linked to / pulled in statically, without a need to dynamically construct a URL. Would it be hard to offer both?
On Fri, Sep 11, 2015 at 3:06 PM, Leila Zia <leila@wikimedia.org mailto:leila@wikimedia.org> wrote: It's getting exciting. :-)
I'd go with choice 2 since it gives more control to the user while offering what the user can get through choice 1 as well.
Question: will we get page_ids or page_titles or both? It's good to have both.
Leila
On Fri, Sep 11, 2015 at 3:00 PM, Dan Andreescu <dandreescu@wikimedia.org mailto:dandreescu@wikimedia.org> wrote: Hi everyone. End of quarter is rapidly approaching and I wanted to ask a quick question about one of the endpoints we want to push out. We want to let you ask "what are the top articles" but we're not sure how to structure the URL so it's most useful to you. Here are the choices:
Choice 1. /top/{project}/{access}/{days-in-the-past}
Example: top articles via all en.wikipedia sites for the past 30 days: /top/en.wikipedia/all-access/30
Choice 2. /top/{project}/{access}/{start}/{end}
Example: top articles via all en.wikipedia sites from June 12th, 2014 to August 30th, 2015: /top/en.wikipedia/all-access/2014-06-12/2015-08-30
(in all of those,
- {project} means en.wikipedia, commons.wikimedia, etc.
- {access} means access method as in desktop, mobile web, mobile app
)
Which do you prefer? Would any other query style be useful?
Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
-- Gabriel Wicke Principal Engineer, Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
-- Marcel Ruiz Forns Analytics Developer Wikimedia Foundation _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Thank you all for your thoughtful opinions.
Since people want to know the top pages over an arbitrary time period, we think Druid would be the best back-end for that kind of query. But we're not going to push that for the first release. It's very useful to know that's the consensus, we can now start talking to Jaime Crespo about Druid / alternatives, make plans, etc. Until then, the first release is going to have the top endpoint that Joseph wrote about. This is easy to pre-aggregate and dump into Cassandra. Also, the /v1/pageviews/ prefix is going to be on all the endpoints we launch with, because these are endpoints in a "pageviews" RESTBase module. So we'll have:
/v1/pageviews/top/{project}/{access}/{year}/{month}/{day}
for now, with {month} and {day} being optional parameters. This will give you the top pageviews for the selected calendar date. And as soon as we can, we'll have:
/v1/pageviews/top/{project}/{access}/from/{start}{/end}
As proposed by Gabriel, with {start} and {end} taking both full dates and "now"-relative negative integers.
The initial endpoint we launch won't have hourly resolution, that seems like too much data to pre-aggregate. But we'll see how Druid handles very specific dates (should be fine) and make that a feature in the second version. We'll have to look into the privacy implications of short time ranges, like an hour.
On Mon, Sep 14, 2015 at 10:18 AM, Andrew Otto aotto@wikimedia.org wrote:
Also, maybe *top-articles* instead of *top*, to avoid naming collision in the future?
+1 for prefixing whatever paths you are doing now with something relevant. I sense that there might be more than just pageview data in the future.
/pageviews/top/…?
On Sep 11, 2015, at 18:38, Marcel Ruiz Forns mforns@wikimedia.org wrote:
+1 Adam
Also, maybe *top-articles* instead of *top*, to avoid naming collision in the future?
On Sat, Sep 12, 2015 at 12:27 AM, Adam Baso abaso@wikimedia.org wrote:
I'd be in favor of both. Maybe with a little tweak to the pathing:
/top/{project}/{access}/days/{days-in-the-past}
/top/{project}/{access}/range/{start}/{end}
with "days" or "range" maybe being earlier in the forward slash separated spec if it doesn't read well semantically.
On Fri, Sep 11, 2015 at 3:14 PM, Dan Andreescu dandreescu@wikimedia.org wrote:
It wouldn't be too hard to offer both, but I'm thinking it might be confusing for a consumer. I think ultimately the decision should be up to the people using this data, because the use cases are fairly different for each form. If people ask for both, we'll do both.
Leila, we'd love to have page_ids as well, but we'd have to block the release on a bigger effort to reliably mirror mediawiki databases in Hadoop for processing, so we'll probably punt on that for now. But we have more than many reasons to work on that sooner than later.
On Fri, Sep 11, 2015 at 6:09 PM, Gabriel Wicke gwicke@wikimedia.org wrote:
The former might be slightly easier to cache, and can be linked to / pulled in statically, without a need to dynamically construct a URL. Would it be hard to offer both?
On Fri, Sep 11, 2015 at 3:06 PM, Leila Zia leila@wikimedia.org wrote:
It's getting exciting. :-)
I'd go with choice 2 since it gives more control to the user while offering what the user can get through choice 1 as well.
Question: will we get page_ids or page_titles or both? It's good to have both.
Leila
On Fri, Sep 11, 2015 at 3:00 PM, Dan Andreescu <dandreescu@wikimedia .org> wrote:
Hi everyone. End of quarter is rapidly approaching and I wanted to ask a quick question about one of the endpoints we want to push out. We want to let you ask "what are the top articles" but we're not sure how to structure the URL so it's most useful to you. Here are the choices:
Choice 1. /top/{project}/{access}/{days-in-the-past}
Example: top articles via all en.wikipedia sites for the past 30 days: /top/en.wikipedia/all-access/30
Choice 2. /top/{project}/{access}/{start}/{end}
Example: top articles via all en.wikipedia sites from June 12th, 2014 to August 30th, 2015: /top/en.wikipedia/all-access/2014-06-12/2015-08-30
(in all of those,
- {project} means en.wikipedia, commons.wikimedia, etc.
- {access} means access method as in desktop, mobile web, mobile app
)
Which do you prefer? Would any other query style be useful?
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Gabriel Wicke Principal Engineer, Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- *Marcel Ruiz Forns* Analytics Developer Wikimedia Foundation _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Hello,
Gabriel, Dan and I are discussing this very same topic on T103811~[1,2,3], so please take a look there and weigh in!
As for the specific endpoints, perhaps it'd be worth switching the places of *top* and the project name to be more in line with the current public RESTful URI layout?
Also, I must admit I find the non-determinism of the endpoints confusing to some extent. Specifically I'm referring to the `/{start}/{end}` portion (or, in your notion, this should really be `/{start}{/end}` denoting that `{end}` is an optional URI parameter), the problem being exactly that `{end}` is optional and, if not supplied, the current date is assumed. That entails that the result of making a request to the endpoint without an end date (or TS) depends on the context (the context in this case being the time stamp of the request). So, one day the request encompasses a span of 24h, while the next that same request refers to a 48h period.
I do agree that this makes it easier for humans to issue requests ("Why would I need to write down today's date?"), but APIs are meant to be only *human-friendly*, not *for humans* (yes, there is a difference :P). What I mean is that it should feel natural for humans to create / programme calls to the API and then use these results in their applications/presentations/etc. In that context, there is literally no difference between:
- give me the list of top articles for the past 30 days (this is how the human asks the question) - give me the list of top articles starting from 2015-08-15 (for an application, that's just a matter of computing `current_time() - 1m`) - give me the list of top articles starting form 2015-08-15 and ending on 2015-09-15 (idem as above plus a call to `current_time()`)
Unless, of course, you target mostly human requests, in which case my argument is rendered moot :P
My 2 cents, Marko
[1] https://phabricator.wikimedia.org/T103811 [2] https://phabricator.wikimedia.org/T103811#1639417 [3] https://phabricator.wikimedia.org/T103811#1640977
On 14 September 2015 at 16:53, Dan Andreescu dandreescu@wikimedia.org wrote:
Thank you all for your thoughtful opinions.
Since people want to know the top pages over an arbitrary time period, we think Druid would be the best back-end for that kind of query. But we're not going to push that for the first release. It's very useful to know that's the consensus, we can now start talking to Jaime Crespo about Druid / alternatives, make plans, etc. Until then, the first release is going to have the top endpoint that Joseph wrote about. This is easy to pre-aggregate and dump into Cassandra. Also, the /v1/pageviews/ prefix is going to be on all the endpoints we launch with, because these are endpoints in a "pageviews" RESTBase module. So we'll have:
/v1/pageviews/top/{project}/{access}/{year}/{month}/{day}
for now, with {month} and {day} being optional parameters. This will give you the top pageviews for the selected calendar date. And as soon as we can, we'll have:
/v1/pageviews/top/{project}/{access}/from/{start}{/end}
As proposed by Gabriel, with {start} and {end} taking both full dates and "now"-relative negative integers.
The initial endpoint we launch won't have hourly resolution, that seems like too much data to pre-aggregate. But we'll see how Druid handles very specific dates (should be fine) and make that a feature in the second version. We'll have to look into the privacy implications of short time ranges, like an hour.
On Mon, Sep 14, 2015 at 10:18 AM, Andrew Otto aotto@wikimedia.org wrote:
Also, maybe *top-articles* instead of *top*, to avoid naming collision in the future?
+1 for prefixing whatever paths you are doing now with something relevant. I sense that there might be more than just pageview data in the future.
/pageviews/top/…?
On Sep 11, 2015, at 18:38, Marcel Ruiz Forns mforns@wikimedia.org wrote:
+1 Adam
Also, maybe *top-articles* instead of *top*, to avoid naming collision in the future?
On Sat, Sep 12, 2015 at 12:27 AM, Adam Baso abaso@wikimedia.org wrote:
I'd be in favor of both. Maybe with a little tweak to the pathing:
/top/{project}/{access}/days/{days-in-the-past}
/top/{project}/{access}/range/{start}/{end}
with "days" or "range" maybe being earlier in the forward slash separated spec if it doesn't read well semantically.
On Fri, Sep 11, 2015 at 3:14 PM, Dan Andreescu <dandreescu@wikimedia.org
wrote:
It wouldn't be too hard to offer both, but I'm thinking it might be confusing for a consumer. I think ultimately the decision should be up to the people using this data, because the use cases are fairly different for each form. If people ask for both, we'll do both.
Leila, we'd love to have page_ids as well, but we'd have to block the release on a bigger effort to reliably mirror mediawiki databases in Hadoop for processing, so we'll probably punt on that for now. But we have more than many reasons to work on that sooner than later.
On Fri, Sep 11, 2015 at 6:09 PM, Gabriel Wicke gwicke@wikimedia.org wrote:
The former might be slightly easier to cache, and can be linked to / pulled in statically, without a need to dynamically construct a URL. Would it be hard to offer both?
On Fri, Sep 11, 2015 at 3:06 PM, Leila Zia leila@wikimedia.org wrote:
It's getting exciting. :-)
I'd go with choice 2 since it gives more control to the user while offering what the user can get through choice 1 as well.
Question: will we get page_ids or page_titles or both? It's good to have both.
Leila
On Fri, Sep 11, 2015 at 3:00 PM, Dan Andreescu <dandreescu@wikimedia .org> wrote:
> Hi everyone. End of quarter is rapidly approaching and I wanted to > ask a quick question about one of the endpoints we want to push out. We > want to let you ask "what are the top articles" but we're not sure how to > structure the URL so it's most useful to you. Here are the choices: > > Choice 1. /top/{project}/{access}/{days-in-the-past} > > Example: top articles via all en.wikipedia sites for the past 30 > days: /top/en.wikipedia/all-access/30 > > > Choice 2. /top/{project}/{access}/{start}/{end} > > Example: top articles via all en.wikipedia sites from June 12th, > 2014 to August 30th, 2015: > /top/en.wikipedia/all-access/2014-06-12/2015-08-30 > > > (in all of those, > > * {project} means en.wikipedia, commons.wikimedia, etc. > * {access} means access method as in desktop, mobile web, mobile app > > ) > > Which do you prefer? Would any other query style be useful? > > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics > >
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Gabriel Wicke Principal Engineer, Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- *Marcel Ruiz Forns* Analytics Developer Wikimedia Foundation _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
+1 for just making the URI consistent and not supporting too many nice human edge cases :)
On Sep 15, 2015, at 06:57, Marko Obrovac mobrovac@wikimedia.org wrote:
Hello,
Gabriel, Dan and I are discussing this very same topic on T103811~[1,2,3], so please take a look there and weigh in!
As for the specific endpoints, perhaps it'd be worth switching the places of *top* and the project name to be more in line with the current public RESTful URI layout?
Also, I must admit I find the non-determinism of the endpoints confusing to some extent. Specifically I'm referring to the `/{start}/{end}` portion (or, in your notion, this should really be `/{start}{/end}` denoting that `{end}` is an optional URI parameter), the problem being exactly that `{end}` is optional and, if not supplied, the current date is assumed. That entails that the result of making a request to the endpoint without an end date (or TS) depends on the context (the context in this case being the time stamp of the request). So, one day the request encompasses a span of 24h, while the next that same request refers to a 48h period.
I do agree that this makes it easier for humans to issue requests ("Why would I need to write down today's date?"), but APIs are meant to be only *human-friendly*, not *for humans* (yes, there is a difference :P). What I mean is that it should feel natural for humans to create / programme calls to the API and then use these results in their applications/presentations/etc. In that context, there is literally no difference between:
- give me the list of top articles for the past 30 days (this is how the human asks the question)
- give me the list of top articles starting from 2015-08-15 (for an application, that's just a matter of computing `current_time() - 1m`)
- give me the list of top articles starting form 2015-08-15 and ending on 2015-09-15 (idem as above plus a call to `current_time()`)
Unless, of course, you target mostly human requests, in which case my argument is rendered moot :P
My 2 cents, Marko
[1] https://phabricator.wikimedia.org/T103811 https://phabricator.wikimedia.org/T103811 [2] https://phabricator.wikimedia.org/T103811#1639417 https://phabricator.wikimedia.org/T103811#1639417 [3] https://phabricator.wikimedia.org/T103811#1640977 https://phabricator.wikimedia.org/T103811#1640977
On 14 September 2015 at 16:53, Dan Andreescu <dandreescu@wikimedia.org mailto:dandreescu@wikimedia.org> wrote: Thank you all for your thoughtful opinions.
Since people want to know the top pages over an arbitrary time period, we think Druid would be the best back-end for that kind of query. But we're not going to push that for the first release. It's very useful to know that's the consensus, we can now start talking to Jaime Crespo about Druid / alternatives, make plans, etc. Until then, the first release is going to have the top endpoint that Joseph wrote about. This is easy to pre-aggregate and dump into Cassandra. Also, the /v1/pageviews/ prefix is going to be on all the endpoints we launch with, because these are endpoints in a "pageviews" RESTBase module. So we'll have:
/v1/pageviews/top/{project}/{access}/{year}/{month}/{day}
for now, with {month} and {day} being optional parameters. This will give you the top pageviews for the selected calendar date. And as soon as we can, we'll have:
/v1/pageviews/top/{project}/{access}/from/{start}{/end}
As proposed by Gabriel, with {start} and {end} taking both full dates and "now"-relative negative integers.
The initial endpoint we launch won't have hourly resolution, that seems like too much data to pre-aggregate. But we'll see how Druid handles very specific dates (should be fine) and make that a feature in the second version. We'll have to look into the privacy implications of short time ranges, like an hour.
On Mon, Sep 14, 2015 at 10:18 AM, Andrew Otto <aotto@wikimedia.org mailto:aotto@wikimedia.org> wrote:
Also, maybe top-articles instead of top, to avoid naming collision in the future?
+1 for prefixing whatever paths you are doing now with something relevant. I sense that there might be more than just pageview data in the future.
/pageviews/top/…?
On Sep 11, 2015, at 18:38, Marcel Ruiz Forns <mforns@wikimedia.org mailto:mforns@wikimedia.org> wrote:
+1 Adam
Also, maybe top-articles instead of top, to avoid naming collision in the future?
On Sat, Sep 12, 2015 at 12:27 AM, Adam Baso <abaso@wikimedia.org mailto:abaso@wikimedia.org> wrote: I'd be in favor of both. Maybe with a little tweak to the pathing:
/top/{project}/{access}/days/{days-in-the-past}
/top/{project}/{access}/range/{start}/{end}
with "days" or "range" maybe being earlier in the forward slash separated spec if it doesn't read well semantically.
On Fri, Sep 11, 2015 at 3:14 PM, Dan Andreescu <dandreescu@wikimedia.org mailto:dandreescu@wikimedia.org> wrote: It wouldn't be too hard to offer both, but I'm thinking it might be confusing for a consumer. I think ultimately the decision should be up to the people using this data, because the use cases are fairly different for each form. If people ask for both, we'll do both.
Leila, we'd love to have page_ids as well, but we'd have to block the release on a bigger effort to reliably mirror mediawiki databases in Hadoop for processing, so we'll probably punt on that for now. But we have more than many reasons to work on that sooner than later.
On Fri, Sep 11, 2015 at 6:09 PM, Gabriel Wicke <gwicke@wikimedia.org mailto:gwicke@wikimedia.org> wrote: The former might be slightly easier to cache, and can be linked to / pulled in statically, without a need to dynamically construct a URL. Would it be hard to offer both?
On Fri, Sep 11, 2015 at 3:06 PM, Leila Zia <leila@wikimedia.org mailto:leila@wikimedia.org> wrote: It's getting exciting. :-)
I'd go with choice 2 since it gives more control to the user while offering what the user can get through choice 1 as well.
Question: will we get page_ids or page_titles or both? It's good to have both.
Leila
On Fri, Sep 11, 2015 at 3:00 PM, Dan Andreescu <dandreescu@wikimedia.org mailto:dandreescu@wikimedia.org> wrote: Hi everyone. End of quarter is rapidly approaching and I wanted to ask a quick question about one of the endpoints we want to push out. We want to let you ask "what are the top articles" but we're not sure how to structure the URL so it's most useful to you. Here are the choices:
Choice 1. /top/{project}/{access}/{days-in-the-past}
Example: top articles via all en.wikipedia sites for the past 30 days: /top/en.wikipedia/all-access/30
Choice 2. /top/{project}/{access}/{start}/{end}
Example: top articles via all en.wikipedia sites from June 12th, 2014 to August 30th, 2015: /top/en.wikipedia/all-access/2014-06-12/2015-08-30
(in all of those,
- {project} means en.wikipedia, commons.wikimedia, etc.
- {access} means access method as in desktop, mobile web, mobile app
)
Which do you prefer? Would any other query style be useful?
Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
-- Gabriel Wicke Principal Engineer, Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
-- Marcel Ruiz Forns Analytics Developer Wikimedia Foundation _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
-- Marko Obrovac, PhD Senior Services Engineer Wikimedia Foundation _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
One thing that i want to bring up that may not be captured in the endpoint described above (but maybe exists in another endpoint? I havn't been following). Within search we would like to integrate page view statistics into our completion suggestions api. This indexes will be built once a week from a bulk process. Ideally we would like to be able to send over 100 or so titles and get page average hourly page views (random guess on exactly which, but something that indicates the relative popularity of the page) for the past week.
On Tue, Sep 15, 2015 at 6:43 AM, Andrew Otto aotto@wikimedia.org wrote:
+1 for just making the URI consistent and not supporting too many nice human edge cases :)
On Sep 15, 2015, at 06:57, Marko Obrovac mobrovac@wikimedia.org wrote:
Hello,
Gabriel, Dan and I are discussing this very same topic on T103811~[1,2,3], so please take a look there and weigh in!
As for the specific endpoints, perhaps it'd be worth switching the places of *top* and the project name to be more in line with the current public RESTful URI layout?
Also, I must admit I find the non-determinism of the endpoints confusing to some extent. Specifically I'm referring to the `/{start}/{end}` portion (or, in your notion, this should really be `/{start}{/end}` denoting that `{end}` is an optional URI parameter), the problem being exactly that `{end}` is optional and, if not supplied, the current date is assumed. That entails that the result of making a request to the endpoint without an end date (or TS) depends on the context (the context in this case being the time stamp of the request). So, one day the request encompasses a span of 24h, while the next that same request refers to a 48h period.
I do agree that this makes it easier for humans to issue requests ("Why would I need to write down today's date?"), but APIs are meant to be only *human-friendly*, not *for humans* (yes, there is a difference :P). What I mean is that it should feel natural for humans to create / programme calls to the API and then use these results in their applications/presentations/etc. In that context, there is literally no difference between:
- give me the list of top articles for the past 30 days (this is how the
human asks the question)
- give me the list of top articles starting from 2015-08-15 (for an
application, that's just a matter of computing `current_time() - 1m`)
- give me the list of top articles starting form 2015-08-15 and ending on
2015-09-15 (idem as above plus a call to `current_time()`)
Unless, of course, you target mostly human requests, in which case my argument is rendered moot :P
My 2 cents, Marko
[1] https://phabricator.wikimedia.org/T103811 [2] https://phabricator.wikimedia.org/T103811#1639417 [3] https://phabricator.wikimedia.org/T103811#1640977
On 14 September 2015 at 16:53, Dan Andreescu dandreescu@wikimedia.org wrote:
Thank you all for your thoughtful opinions.
Since people want to know the top pages over an arbitrary time period, we think Druid would be the best back-end for that kind of query. But we're not going to push that for the first release. It's very useful to know that's the consensus, we can now start talking to Jaime Crespo about Druid / alternatives, make plans, etc. Until then, the first release is going to have the top endpoint that Joseph wrote about. This is easy to pre-aggregate and dump into Cassandra. Also, the /v1/pageviews/ prefix is going to be on all the endpoints we launch with, because these are endpoints in a "pageviews" RESTBase module. So we'll have:
/v1/pageviews/top/{project}/{access}/{year}/{month}/{day}
for now, with {month} and {day} being optional parameters. This will give you the top pageviews for the selected calendar date. And as soon as we can, we'll have:
/v1/pageviews/top/{project}/{access}/from/{start}{/end}
As proposed by Gabriel, with {start} and {end} taking both full dates and "now"-relative negative integers.
The initial endpoint we launch won't have hourly resolution, that seems like too much data to pre-aggregate. But we'll see how Druid handles very specific dates (should be fine) and make that a feature in the second version. We'll have to look into the privacy implications of short time ranges, like an hour.
On Mon, Sep 14, 2015 at 10:18 AM, Andrew Otto aotto@wikimedia.org wrote:
Also, maybe *top-articles* instead of *top*, to avoid naming collision in the future?
+1 for prefixing whatever paths you are doing now with something relevant. I sense that there might be more than just pageview data in the future.
/pageviews/top/…?
On Sep 11, 2015, at 18:38, Marcel Ruiz Forns mforns@wikimedia.org wrote:
+1 Adam
Also, maybe *top-articles* instead of *top*, to avoid naming collision in the future?
On Sat, Sep 12, 2015 at 12:27 AM, Adam Baso abaso@wikimedia.org wrote:
I'd be in favor of both. Maybe with a little tweak to the pathing:
/top/{project}/{access}/days/{days-in-the-past}
/top/{project}/{access}/range/{start}/{end}
with "days" or "range" maybe being earlier in the forward slash separated spec if it doesn't read well semantically.
On Fri, Sep 11, 2015 at 3:14 PM, Dan Andreescu < dandreescu@wikimedia.org> wrote:
It wouldn't be too hard to offer both, but I'm thinking it might be confusing for a consumer. I think ultimately the decision should be up to the people using this data, because the use cases are fairly different for each form. If people ask for both, we'll do both.
Leila, we'd love to have page_ids as well, but we'd have to block the release on a bigger effort to reliably mirror mediawiki databases in Hadoop for processing, so we'll probably punt on that for now. But we have more than many reasons to work on that sooner than later.
On Fri, Sep 11, 2015 at 6:09 PM, Gabriel Wicke gwicke@wikimedia.org wrote:
The former might be slightly easier to cache, and can be linked to / pulled in statically, without a need to dynamically construct a URL. Would it be hard to offer both?
On Fri, Sep 11, 2015 at 3:06 PM, Leila Zia leila@wikimedia.org wrote:
> It's getting exciting. :-) > > I'd go with choice 2 since it gives more control to the user while > offering what the user can get through choice 1 as well. > > Question: will we get page_ids or page_titles or both? It's good to > have both. > > Leila > > On Fri, Sep 11, 2015 at 3:00 PM, Dan Andreescu <dandreescu@wikimedia > .org> wrote: > >> Hi everyone. End of quarter is rapidly approaching and I wanted to >> ask a quick question about one of the endpoints we want to push out. We >> want to let you ask "what are the top articles" but we're not sure how to >> structure the URL so it's most useful to you. Here are the choices: >> >> Choice 1. /top/{project}/{access}/{days-in-the-past} >> >> Example: top articles via all en.wikipedia sites for the past 30 >> days: /top/en.wikipedia/all-access/30 >> >> >> Choice 2. /top/{project}/{access}/{start}/{end} >> >> Example: top articles via all en.wikipedia sites from June 12th, >> 2014 to August 30th, 2015: >> /top/en.wikipedia/all-access/2014-06-12/2015-08-30 >> >> >> (in all of those, >> >> * {project} means en.wikipedia, commons.wikimedia, etc. >> * {access} means access method as in desktop, mobile web, mobile app >> >> ) >> >> Which do you prefer? Would any other query style be useful? >> >> _______________________________________________ >> Analytics mailing list >> Analytics@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/analytics >> >> > > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics > >
-- Gabriel Wicke Principal Engineer, Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- *Marcel Ruiz Forns* Analytics Developer Wikimedia Foundation _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Marko Obrovac, PhD Senior Services Engineer Wikimedia Foundation _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
One thing that i want to bring up that may not be captured in the endpoint described above (but maybe exists in another endpoint? I havn't been following). Within search we would like to integrate page view statistics into our completion suggestions api. This indexes will be built once a week from a bulk process. Ideally we would like to be able to send over 100 or so titles and get page average hourly page views (random guess on exactly which, but something that indicates the relative popularity of the page) for the past week.
This is possible using our "per-article" endpoint. There you can get hourly pageviews for a particular page title for an arbitrary time range. So grab a week of data for each of the 100 titles, then you can average or do whatever you like.
On Tue, Sep 15, 2015 at 10:07 AM, Dan Andreescu dandreescu@wikimedia.org wrote:
One thing that i want to bring up that may not be captured in the endpoint
described above (but maybe exists in another endpoint? I havn't been following). Within search we would like to integrate page view statistics into our completion suggestions api. This indexes will be built once a week from a bulk process. Ideally we would like to be able to send over 100 or so titles and get page average hourly page views (random guess on exactly which, but something that indicates the relative popularity of the page) for the past week.
This is possible using our "per-article" endpoint. There you can get hourly pageviews for a particular page title for an arbitrary time range. So grab a week of data for each of the 100 titles, then you can average or do whatever you like.
I worry a little bit about the performance without having a batch api, but we can certainly try it out and see what happens. Basically we will be requesting the page view information for every NS_MAIN article in every wiki once a week. A quick sum against our search cluster suggests this is ~96 million api requests.
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
I worry a little bit about the performance without having a batch api, but we can certainly try it out and see what happens. Basically we will be requesting the page view information for every NS_MAIN article in every wiki once a week. A quick sum against our search cluster suggests this is ~96 million api requests.
Oh, sorry, I thought you meant you were just querying 100 or so titles! In the case of huge queries like these, you should just query the wmf.pageview_hourly table directly. You can do so with plain SQL via Hive or maybe Impala if we end up setting that up. But those queries should be really fast in that table. We can help you write the query if you send us an attempt and a spec of exactly what you need.
On 15 September 2015 at 19:37, Dan Andreescu dandreescu@wikimedia.org wrote:
I worry a little bit about the performance without having a batch api, but
we can certainly try it out and see what happens. Basically we will be requesting the page view information for every NS_MAIN article in every wiki once a week. A quick sum against our search cluster suggests this is ~96 million api requests.
96m equals approx 160 req/s which is more than sustainable for RESTBase.
Oh, sorry, I thought you meant you were just querying 100 or so titles! In the case of huge queries like these, you should just query the wmf.pageview_hourly table directly. You can do so with plain SQL via Hive or maybe Impala if we end up setting that up. But those queries should be really fast in that table. We can help you write the query if you send us an attempt and a spec of exactly what you need.
My performance-oriented nature would also think about something like that, but I think this is not a decision that is to be taken lightly. While having an API doesn't come for free, its beauty lies in the abstraction. Concretely, as a pageview client, I am aware of the "contract" between the service and myself and as such, I trust it to fulfil its part of the job. How it does it is completely irrelevant to me, thus giving me the opportunity to focus on "my part of the job" (no need for me to worry about the internals of the implementation).
That said, it is clear as day that making 100 requests versus making one batch request takes more time. However, on the one hand, it sounds like Erik's use case is not latency- (or time-) sensitive. On the other, given the nature of the pageview API, the cost of computing the result (in case it is not available right away) dwarfs any connection or other related overheads.
Cheers, Marko
On Tue, Sep 15, 2015 at 6:56 PM, Marko Obrovac mobrovac@wikimedia.org wrote:
On 15 September 2015 at 19:37, Dan Andreescu dandreescu@wikimedia.org wrote:
I worry a little bit about the performance without having a batch api,
but we can certainly try it out and see what happens. Basically we will be requesting the page view information for every NS_MAIN article in every wiki once a week. A quick sum against our search cluster suggests this is ~96 million api requests.
96m equals approx 160 req/s which is more than sustainable for RESTBase.
True, if we distributed the load over the whole week, but I think Erik needs the results to be available weekly, as in, probably within a day or so of issuing the request. Of course, if we were to serve this kind of request from the API, we would make a better batch-query endpoint for his use case. But I think it might be hard to make that useful generally. I think for now, let's just collect these one-off pageview querying use cases and slowly build them into the API when we can generalize two or more of them into one endpoint.
makes sense. We will indeed be doing a batch process once a week to build the completion indices which ideally will run through all the wiki's in a day. We are going to do some analysis into how up to date our page view data really needs to be for scoring purposes though, if we can get good scoring results while only updating page view info when a page is edited we might be able to spread out the load across time that way and just hit the page view api once for each edit. Otherwise i'm sure we can do as suggested earlier and pull the data from hive directly and stuff into a temporary structure we can query while building the completion indices.
On Tue, Sep 15, 2015 at 7:16 PM, Dan Andreescu dandreescu@wikimedia.org wrote:
On Tue, Sep 15, 2015 at 6:56 PM, Marko Obrovac mobrovac@wikimedia.org wrote:
On 15 September 2015 at 19:37, Dan Andreescu dandreescu@wikimedia.org wrote:
I worry a little bit about the performance without having a batch api,
but we can certainly try it out and see what happens. Basically we will be requesting the page view information for every NS_MAIN article in every wiki once a week. A quick sum against our search cluster suggests this is ~96 million api requests.
96m equals approx 160 req/s which is more than sustainable for RESTBase.
True, if we distributed the load over the whole week, but I think Erik needs the results to be available weekly, as in, probably within a day or so of issuing the request. Of course, if we were to serve this kind of request from the API, we would make a better batch-query endpoint for his use case. But I think it might be hard to make that useful generally. I think for now, let's just collect these one-off pageview querying use cases and slowly build them into the API when we can generalize two or more of them into one endpoint.
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Otherwise i'm sure we can do as suggested earlier and pull the data from hive directly and stuff into a temporary structure we can query while building the completion indices.
Do you think that temporary structure might be useful to others? If so, we could add that as a data source, and add an endpoint to query it. Either way, happy to help with the query / temp structure.
@Erik: Reading this thread makes me think that it might be interesting to have a chat around using hadoop for indexing ( https://github.com/elastic/elasticsearch-hadoop). I have no idea how you currently index, but I'd love to learn :) Please let me know if you think it could be useful ! Joseph
On Wed, Sep 16, 2015 at 5:15 AM, Erik Bernhardson < ebernhardson@wikimedia.org> wrote:
makes sense. We will indeed be doing a batch process once a week to build the completion indices which ideally will run through all the wiki's in a day. We are going to do some analysis into how up to date our page view data really needs to be for scoring purposes though, if we can get good scoring results while only updating page view info when a page is edited we might be able to spread out the load across time that way and just hit the page view api once for each edit. Otherwise i'm sure we can do as suggested earlier and pull the data from hive directly and stuff into a temporary structure we can query while building the completion indices.
On Tue, Sep 15, 2015 at 7:16 PM, Dan Andreescu dandreescu@wikimedia.org wrote:
On Tue, Sep 15, 2015 at 6:56 PM, Marko Obrovac mobrovac@wikimedia.org wrote:
On 15 September 2015 at 19:37, Dan Andreescu dandreescu@wikimedia.org wrote:
I worry a little bit about the performance without having a batch api,
but we can certainly try it out and see what happens. Basically we will be requesting the page view information for every NS_MAIN article in every wiki once a week. A quick sum against our search cluster suggests this is ~96 million api requests.
96m equals approx 160 req/s which is more than sustainable for RESTBase.
True, if we distributed the load over the whole week, but I think Erik needs the results to be available weekly, as in, probably within a day or so of issuing the request. Of course, if we were to serve this kind of request from the API, we would make a better batch-query endpoint for his use case. But I think it might be hard to make that useful generally. I think for now, let's just collect these one-off pageview querying use cases and slowly build them into the API when we can generalize two or more of them into one endpoint.
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Hadoop was originally built for indexing the web by processing the web map and exporting indexes to serving systems. I think integration with Elastic Search would work well.
-Toby
On Wed, Sep 16, 2015 at 7:03 AM, Joseph Allemandou < jallemandou@wikimedia.org> wrote:
@Erik: Reading this thread makes me think that it might be interesting to have a chat around using hadoop for indexing ( https://github.com/elastic/elasticsearch-hadoop). I have no idea how you currently index, but I'd love to learn :) Please let me know if you think it could be useful ! Joseph
On Wed, Sep 16, 2015 at 5:15 AM, Erik Bernhardson < ebernhardson@wikimedia.org> wrote:
makes sense. We will indeed be doing a batch process once a week to build the completion indices which ideally will run through all the wiki's in a day. We are going to do some analysis into how up to date our page view data really needs to be for scoring purposes though, if we can get good scoring results while only updating page view info when a page is edited we might be able to spread out the load across time that way and just hit the page view api once for each edit. Otherwise i'm sure we can do as suggested earlier and pull the data from hive directly and stuff into a temporary structure we can query while building the completion indices.
On Tue, Sep 15, 2015 at 7:16 PM, Dan Andreescu dandreescu@wikimedia.org wrote:
On Tue, Sep 15, 2015 at 6:56 PM, Marko Obrovac mobrovac@wikimedia.org wrote:
On 15 September 2015 at 19:37, Dan Andreescu dandreescu@wikimedia.org wrote:
I worry a little bit about the performance without having a batch api,
but we can certainly try it out and see what happens. Basically we will be requesting the page view information for every NS_MAIN article in every wiki once a week. A quick sum against our search cluster suggests this is ~96 million api requests.
96m equals approx 160 req/s which is more than sustainable for RESTBase.
True, if we distributed the load over the whole week, but I think Erik needs the results to be available weekly, as in, probably within a day or so of issuing the request. Of course, if we were to serve this kind of request from the API, we would make a better batch-query endpoint for his use case. But I think it might be hard to make that useful generally. I think for now, let's just collect these one-off pageview querying use cases and slowly build them into the API when we can generalize two or more of them into one endpoint.
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- *Joseph Allemandou* Data Engineer @ Wikimedia Foundation IRC: joal
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Hello,
Just a small note which I don't think has been voiced thus far. There will actually be two APIs - one exposed by the Analytics' RESTBase instance, which will be accessible only from inside of WMF's infrastructure, and another, public-facing one (exposed by the Services' RESTBase instance).
Now, these may be identical (both in layout and functionality) or may (slightly) differ. Which way to go? The big pro of them being identical is that the client wouldn't need to care which RESTBase instance it is actually contacting. That would also ease API maintenance. On the down side, that increases the overhead for Analytics to keep their domain list in sync.
Having a more specialised API for the Analytics instance, on the other hand, would allow us to tailor it more for real internal use cases instead of focusing on the overall API coherence (which we need to do for the public-facing API). I'd honestly vote for that option.
On 16 September 2015 at 16:06, Toby Negrin tnegrin@wikimedia.org wrote:
Hadoop was originally built for indexing the web by processing the web map and exporting indexes to serving systems. I think integration with Elastic Search would work well.
Right, both are indexing systems (so to speak), but the former is for offline use, while the latter targets online use. Ideally, we should make them cooperate to get the best out of both worlds.
Cheers, Marko
-Toby
On Wed, Sep 16, 2015 at 7:03 AM, Joseph Allemandou < jallemandou@wikimedia.org> wrote:
@Erik: Reading this thread makes me think that it might be interesting to have a chat around using hadoop for indexing ( https://github.com/elastic/elasticsearch-hadoop). I have no idea how you currently index, but I'd love to learn :) Please let me know if you think it could be useful ! Joseph
On Wed, Sep 16, 2015 at 5:15 AM, Erik Bernhardson < ebernhardson@wikimedia.org> wrote:
makes sense. We will indeed be doing a batch process once a week to build the completion indices which ideally will run through all the wiki's in a day. We are going to do some analysis into how up to date our page view data really needs to be for scoring purposes though, if we can get good scoring results while only updating page view info when a page is edited we might be able to spread out the load across time that way and just hit the page view api once for each edit. Otherwise i'm sure we can do as suggested earlier and pull the data from hive directly and stuff into a temporary structure we can query while building the completion indices.
On Tue, Sep 15, 2015 at 7:16 PM, Dan Andreescu <dandreescu@wikimedia.org
wrote:
On Tue, Sep 15, 2015 at 6:56 PM, Marko Obrovac mobrovac@wikimedia.org wrote:
On 15 September 2015 at 19:37, Dan Andreescu <dandreescu@wikimedia.org
wrote:
I worry a little bit about the performance without having a batch > api, but we can certainly try it out and see what happens. Basically we > will be requesting the page view information for every NS_MAIN article in > every wiki once a week. A quick sum against our search cluster suggests > this is ~96 million api requests. >
96m equals approx 160 req/s which is more than sustainable for RESTBase.
True, if we distributed the load over the whole week, but I think Erik needs the results to be available weekly, as in, probably within a day or so of issuing the request. Of course, if we were to serve this kind of request from the API, we would make a better batch-query endpoint for his use case. But I think it might be hard to make that useful generally. I think for now, let's just collect these one-off pageview querying use cases and slowly build them into the API when we can generalize two or more of them into one endpoint.
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- *Joseph Allemandou* Data Engineer @ Wikimedia Foundation IRC: joal
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
On 22 September 2015 at 05:10, Marko Obrovac mobrovac@wikimedia.org wrote:
Hello,
Just a small note which I don't think has been voiced thus far. There will actually be two APIs - one exposed by the Analytics' RESTBase instance, which will be accessible only from inside of WMF's infrastructure, and another, public-facing one (exposed by the Services' RESTBase instance).
Now, these may be identical (both in layout and functionality) or may (slightly) differ. Which way to go? The big pro of them being identical is that the client wouldn't need to care which RESTBase instance it is actually contacting. That would also ease API maintenance. On the down side, that increases the overhead for Analytics to keep their domain list in sync.
Having a more specialised API for the Analytics instance, on the other hand, would allow us to tailor it more for real internal use cases instead of focusing on the overall API coherence (which we need to do for the public-facing API). I'd honestly vote for that option.
Can you give an example of internal-facing use cases you don't see a broader population of consumers being interested in?
On 16 September 2015 at 16:06, Toby Negrin tnegrin@wikimedia.org wrote:
Hadoop was originally built for indexing the web by processing the web map and exporting indexes to serving systems. I think integration with Elastic Search would work well.
Right, both are indexing systems (so to speak), but the former is for offline use, while the latter targets online use. Ideally, we should make them cooperate to get the best out of both worlds.
Cheers, Marko
-Toby
On Wed, Sep 16, 2015 at 7:03 AM, Joseph Allemandou jallemandou@wikimedia.org wrote:
@Erik: Reading this thread makes me think that it might be interesting to have a chat around using hadoop for indexing (https://github.com/elastic/elasticsearch-hadoop). I have no idea how you currently index, but I'd love to learn :) Please let me know if you think it could be useful ! Joseph
On Wed, Sep 16, 2015 at 5:15 AM, Erik Bernhardson ebernhardson@wikimedia.org wrote:
makes sense. We will indeed be doing a batch process once a week to build the completion indices which ideally will run through all the wiki's in a day. We are going to do some analysis into how up to date our page view data really needs to be for scoring purposes though, if we can get good scoring results while only updating page view info when a page is edited we might be able to spread out the load across time that way and just hit the page view api once for each edit. Otherwise i'm sure we can do as suggested earlier and pull the data from hive directly and stuff into a temporary structure we can query while building the completion indices.
On Tue, Sep 15, 2015 at 7:16 PM, Dan Andreescu dandreescu@wikimedia.org wrote:
On Tue, Sep 15, 2015 at 6:56 PM, Marko Obrovac mobrovac@wikimedia.org wrote:
On 15 September 2015 at 19:37, Dan Andreescu dandreescu@wikimedia.org wrote: >> >> I worry a little bit about the performance without having a batch >> api, but we can certainly try it out and see what happens. Basically we will >> be requesting the page view information for every NS_MAIN article in every >> wiki once a week. A quick sum against our search cluster suggests this is >> ~96 million api requests.
96m equals approx 160 req/s which is more than sustainable for RESTBase.
True, if we distributed the load over the whole week, but I think Erik needs the results to be available weekly, as in, probably within a day or so of issuing the request. Of course, if we were to serve this kind of request from the API, we would make a better batch-query endpoint for his use case. But I think it might be hard to make that useful generally. I think for now, let's just collect these one-off pageview querying use cases and slowly build them into the API when we can generalize two or more of them into one endpoint.
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Marko Obrovac, PhD Senior Services Engineer Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
On 22 September 2015 at 14:40, Oliver Keyes okeyes@wikimedia.org wrote:
On 22 September 2015 at 05:10, Marko Obrovac mobrovac@wikimedia.org wrote:
Hello,
Just a small note which I don't think has been voiced thus far. There
will
actually be two APIs - one exposed by the Analytics' RESTBase instance, which will be accessible only from inside of WMF's infrastructure, and another, public-facing one (exposed by the Services' RESTBase instance).
Now, these may be identical (both in layout and functionality) or may (slightly) differ. Which way to go? The big pro of them being identical
is
that the client wouldn't need to care which RESTBase instance it is
actually
contacting. That would also ease API maintenance. On the down side, that increases the overhead for Analytics to keep their domain list in sync.
Having a more specialised API for the Analytics instance, on the other
hand,
would allow us to tailor it more for real internal use cases instead of focusing on the overall API coherence (which we need to do for the public-facing API). I'd honestly vote for that option.
Can you give an example of internal-facing use cases you don't see a broader population of consumers being interested in?
In my mail I was mostly hinting to the fact that the public-facing API is divided by domains, whilst the notion of projects is better suited for Analytics. So the internal API could be organised around projects while still supporting domains but in a looser format than the public one.
We plan to support arbitrary projects (such as en-all, all-wiktionary, etc) on the public side as well, but because of the current layout, the analytics' (public) API will be fragmented. There is no need to do such a thing with the internal API too.
To concretely answer the question, I am not aware of any specific use case. Just pointing out that internal users can, if they need/want, rely on projects rather than on domains.
Cheers, Marko
On 16 September 2015 at 16:06, Toby Negrin tnegrin@wikimedia.org
wrote:
Hadoop was originally built for indexing the web by processing the web
map
and exporting indexes to serving systems. I think integration with
Elastic
Search would work well.
Right, both are indexing systems (so to speak), but the former is for offline use, while the latter targets online use. Ideally, we should make them cooperate to get the best out of both worlds.
Cheers, Marko
-Toby
On Wed, Sep 16, 2015 at 7:03 AM, Joseph Allemandou jallemandou@wikimedia.org wrote:
@Erik: Reading this thread makes me think that it might be interesting to
have a
chat around using hadoop for indexing (https://github.com/elastic/elasticsearch-hadoop). I have no idea how you currently index, but I'd love to learn :) Please let me know if you think it could be useful ! Joseph
On Wed, Sep 16, 2015 at 5:15 AM, Erik Bernhardson ebernhardson@wikimedia.org wrote:
makes sense. We will indeed be doing a batch process once a week to build the completion indices which ideally will run through all the
wiki's
in a day. We are going to do some analysis into how up to date our
page view
data really needs to be for scoring purposes though, if we can get
good
scoring results while only updating page view info when a page is
edited we
might be able to spread out the load across time that way and just
hit the
page view api once for each edit. Otherwise i'm sure we can do as
suggested
earlier and pull the data from hive directly and stuff into a
temporary
structure we can query while building the completion indices.
On Tue, Sep 15, 2015 at 7:16 PM, Dan Andreescu dandreescu@wikimedia.org wrote:
On Tue, Sep 15, 2015 at 6:56 PM, Marko Obrovac <
mobrovac@wikimedia.org>
wrote: > > On 15 September 2015 at 19:37, Dan Andreescu > dandreescu@wikimedia.org wrote: >>> >>> I worry a little bit about the performance without having a batch >>> api, but we can certainly try it out and see what happens.
Basically we will
>>> be requesting the page view information for every NS_MAIN article
in every
>>> wiki once a week. A quick sum against our search cluster
suggests this is
>>> ~96 million api requests. > > > 96m equals approx 160 req/s which is more than sustainable for > RESTBase.
True, if we distributed the load over the whole week, but I think
Erik
needs the results to be available weekly, as in, probably within a
day or so
of issuing the request. Of course, if we were to serve this kind of
request
from the API, we would make a better batch-query endpoint for his
use case.
But I think it might be hard to make that useful generally. I think
for
now, let's just collect these one-off pageview querying use cases
and slowly
build them into the API when we can generalize two or more of them
into one
endpoint.
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Marko Obrovac, PhD Senior Services Engineer Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Count Logula Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Thanks for your thoughts, Marko.
As for the specific endpoints, perhaps it'd be worth switching the places of *top* and the project name to be more in line with the current public RESTful URI layout?
We can talk about this as part of the phab tasks you linked to. If we keep the project as one of the parameters, then it makes sense for it to be further down in the URI, where the other parameters are. "top" in that case would just be the type of query, not a parameter. If we use the domain in place of the project, which we can only do in non-aggregate cases, then that would be in line with the other public RESTful URIs.
Also, I must admit I find the non-determinism of the endpoints confusing to some extent. Specifically I'm referring to the `/{start}/{end}` portion (or, in your notion, this should really be `/{start}{/end}` denoting that `{end}` is an optional URI parameter), the problem being exactly that `{end}` is optional and, if not supplied, the current date is assumed. That entails that the result of making a request to the endpoint without an end date (or TS) depends on the context (the context in this case being the time stamp of the request). So, one day the request encompasses a span of 24h, while the next that same request refers to a 48h period.
[... snip good points ...]
Hm, I think I'm leaning towards Marko and Andrew's point of view. Relative-valued parameters seem to make caching confusing to think about too. Like start=-30 and end=-1 would have to be evicted precisely and surely at midnight, but which timezone? :)
Ok, so we'll go with absolute-valued deterministic parameters (our current code has only these types of parameters). Maybe if Druid serves as our data store, we can think about this again and see if we can provide a human-friendly interface too.
Unless, of course, you target mostly human requests, in which case my
argument is rendered moot :P
I think it's pretty mixed actually, we have asks from humans, bot-writers, analysts, teams at WMF, etc.
I concur with Leila.
Paul
--------- Original Message --------- Subject: Re: [Analytics] [Survey] Pageview API From: "Leila Zia" leila@wikimedia.org Date: 9/11/15 3:06 pm To: "A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics." analytics@lists.wikimedia.org
It's getting exciting. :-)
I'd go with choice 2 since it gives more control to the user while offering what the user can get through choice 1 as well.
Question: will we get page_ids or page_titles or both? It's good to have both.
Leila
On Fri, Sep 11, 2015 at 3:00 PM, Dan Andreescu dandreescu@wikimedia.org wrote: Hi everyone. End of quarter is rapidly approaching and I wanted to ask a quick question about one of the endpoints we want to push out. We want to let you ask "what are the top articles" but we're not sure how to structure the URL so it's most useful to you. Here are the choices: Choice 1. /top/{project}/{access}/{days-in-the-past}
Example: top articles via all en.wikipedia sites for the past 30 days: /top/en.wikipedia/all-access/30
Choice 2. /top/{project}/{access}/{start}/{end}
Example: top articles via all en.wikipedia sites from June 12th, 2014 to August 30th, 2015: /top/en.wikipedia/all-access/2014-06-12/2015-08-30
(in all of those,
* {project} means en.wikipedia, commons.wikimedia, etc. * {access} means access method as in desktop, mobile web, mobile app
)
Which do you prefer? Would any other query style be useful?
_______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Another option would be a single entry point
/top/{project}/{access}/from/{start}{/end}
with support for negative indexes for 'days in the past':
/top/{project}/{access}/from/-30 /top/{project}/{access}/from/-60/-30
as well as full dates:
/top/en.wikipedia/all-access/2014-06-12/2015-08-30
On Fri, Sep 11, 2015 at 3:19 PM, paul@paulweiss.info wrote:
I concur with Leila.
Paul
--------- Original Message --------- Subject: Re: [Analytics] [Survey] Pageview API From: "Leila Zia" leila@wikimedia.org Date: 9/11/15 3:06 pm To: "A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics." analytics@lists.wikimedia.org
It's getting exciting. :-)
I'd go with choice 2 since it gives more control to the user while offering what the user can get through choice 1 as well.
Question: will we get page_ids or page_titles or both? It's good to have both.
Leila
On Fri, Sep 11, 2015 at 3:00 PM, Dan Andreescu dandreescu@wikimedia.org wrote:
Hi everyone. End of quarter is rapidly approaching and I wanted to ask a quick question about one of the endpoints we want to push out. We want to let you ask "what are the top articles" but we're not sure how to structure the URL so it's most useful to you. Here are the choices:
Choice 1. /top/{project}/{access}/{days-in-the-past}
Example: top articles via all en.wikipedia sites for the past 30 days: /top/en.wikipedia/all-access/30
Choice 2. /top/{project}/{access}/{start}/{end}
Example: top articles via all en.wikipedia sites from June 12th, 2014 to August 30th, 2015: /top/en.wikipedia/all-access/2014-06-12/2015-08-30
(in all of those,
- {project} means en.wikipedia, commons.wikimedia, etc.
- {access} means access method as in desktop, mobile web, mobile app
)
Which do you prefer? Would any other query style be useful?
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
On Fri, Sep 11, 2015 at 4:26 PM, Gabriel Wicke gwicke@wikimedia.org wrote:
Another option would be a single entry point
/top/{project}/{access}/from/{start}{/end}
with support for negative indexes for 'days in the past':
/top/{project}/{access}/from/-30 /top/{project}/{access}/from/-60/-30
as well as full dates:
/top/en.wikipedia/all-access/2014-06-12/2015-08-30
Correction:
/top/en.wikipedia/all-access/from/2014-06-12/2015-08-30
On Fri, Sep 11, 2015 at 3:19 PM, paul@paulweiss.info wrote:
I concur with Leila.
Paul
--------- Original Message --------- Subject: Re: [Analytics] [Survey] Pageview API From: "Leila Zia" leila@wikimedia.org Date: 9/11/15 3:06 pm To: "A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics." analytics@lists.wikimedia.org
It's getting exciting. :-)
I'd go with choice 2 since it gives more control to the user while offering what the user can get through choice 1 as well.
Question: will we get page_ids or page_titles or both? It's good to have both.
Leila
On Fri, Sep 11, 2015 at 3:00 PM, Dan Andreescu dandreescu@wikimedia.org wrote:
Hi everyone. End of quarter is rapidly approaching and I wanted to ask a quick question about one of the endpoints we want to push out. We want to let you ask "what are the top articles" but we're not sure how to structure the URL so it's most useful to you. Here are the choices:
Choice 1. /top/{project}/{access}/{days-in-the-past}
Example: top articles via all en.wikipedia sites for the past 30 days: /top/en.wikipedia/all-access/30
Choice 2. /top/{project}/{access}/{start}/{end}
Example: top articles via all en.wikipedia sites from June 12th, 2014 to August 30th, 2015: /top/en.wikipedia/all-access/2014-06-12/2015-08-30
(in all of those,
- {project} means en.wikipedia, commons.wikimedia, etc.
- {access} means access method as in desktop, mobile web, mobile app
)
Which do you prefer? Would any other query style be useful?
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Gabriel Wicke Principal Engineer, Wikimedia Foundation
Big +1 to Adam. Is the top articles the first deliverable we should expect?
On 11 September 2015 at 19:27, Gabriel Wicke gwicke@wikimedia.org wrote:
On Fri, Sep 11, 2015 at 4:26 PM, Gabriel Wicke gwicke@wikimedia.org wrote:
Another option would be a single entry point
/top/{project}/{access}/from/{start}{/end}
with support for negative indexes for 'days in the past':
/top/{project}/{access}/from/-30 /top/{project}/{access}/from/-60/-30
as well as full dates:
/top/en.wikipedia/all-access/2014-06-12/2015-08-30
Correction:
/top/en.wikipedia/all-access/from/2014-06-12/2015-08-30
On Fri, Sep 11, 2015 at 3:19 PM, paul@paulweiss.info wrote:
I concur with Leila.
Paul
--------- Original Message --------- Subject: Re: [Analytics] [Survey] Pageview API From: "Leila Zia" leila@wikimedia.org Date: 9/11/15 3:06 pm To: "A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics." analytics@lists.wikimedia.org
It's getting exciting. :-)
I'd go with choice 2 since it gives more control to the user while offering what the user can get through choice 1 as well.
Question: will we get page_ids or page_titles or both? It's good to have both.
Leila
On Fri, Sep 11, 2015 at 3:00 PM, Dan Andreescu dandreescu@wikimedia.org wrote:
Hi everyone. End of quarter is rapidly approaching and I wanted to ask a quick question about one of the endpoints we want to push out. We want to let you ask "what are the top articles" but we're not sure how to structure the URL so it's most useful to you. Here are the choices:
Choice 1. /top/{project}/{access}/{days-in-the-past}
Example: top articles via all en.wikipedia sites for the past 30 days: /top/en.wikipedia/all-access/30
Choice 2. /top/{project}/{access}/{start}/{end}
Example: top articles via all en.wikipedia sites from June 12th, 2014 to August 30th, 2015: /top/en.wikipedia/all-access/2014-06-12/2015-08-30
(in all of those,
- {project} means en.wikipedia, commons.wikimedia, etc.
- {access} means access method as in desktop, mobile web, mobile app
)
Which do you prefer? Would any other query style be useful?
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Gabriel Wicke Principal Engineer, Wikimedia Foundation
-- Gabriel Wicke Principal Engineer, Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
I would prefer to have both days-in-past and start/end daterange options, along the lines of Adam's proposal.
But if I have to choose one, I concur with Leila. Start/end daterange offers more functionality.
Jonathan
On Fri, Sep 11, 2015 at 4:44 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Big +1 to Adam. Is the top articles the first deliverable we should expect?
On 11 September 2015 at 19:27, Gabriel Wicke gwicke@wikimedia.org wrote:
On Fri, Sep 11, 2015 at 4:26 PM, Gabriel Wicke gwicke@wikimedia.org
wrote:
Another option would be a single entry point
/top/{project}/{access}/from/{start}{/end}
with support for negative indexes for 'days in the past':
/top/{project}/{access}/from/-30 /top/{project}/{access}/from/-60/-30
as well as full dates:
/top/en.wikipedia/all-access/2014-06-12/2015-08-30
Correction:
/top/en.wikipedia/all-access/from/2014-06-12/2015-08-30
On Fri, Sep 11, 2015 at 3:19 PM, paul@paulweiss.info wrote:
I concur with Leila.
Paul
--------- Original Message --------- Subject: Re: [Analytics] [Survey] Pageview API From: "Leila Zia" leila@wikimedia.org Date: 9/11/15 3:06 pm To: "A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics." <
analytics@lists.wikimedia.org>
It's getting exciting. :-)
I'd go with choice 2 since it gives more control to the user while offering what the user can get through choice 1 as well.
Question: will we get page_ids or page_titles or both? It's good to
have
both.
Leila
On Fri, Sep 11, 2015 at 3:00 PM, Dan Andreescu <
dandreescu@wikimedia.org>
wrote:
Hi everyone. End of quarter is rapidly approaching and I wanted to
ask
a quick question about one of the endpoints we want to push out. We
want to
let you ask "what are the top articles" but we're not sure how to
structure
the URL so it's most useful to you. Here are the choices:
Choice 1. /top/{project}/{access}/{days-in-the-past}
Example: top articles via all en.wikipedia sites for the past 30 days: /top/en.wikipedia/all-access/30
Choice 2. /top/{project}/{access}/{start}/{end}
Example: top articles via all en.wikipedia sites from June 12th, 2014
to
August 30th, 2015: /top/en.wikipedia/all-access/2014-06-12/2015-08-30
(in all of those,
- {project} means en.wikipedia, commons.wikimedia, etc.
- {access} means access method as in desktop, mobile web, mobile app
)
Which do you prefer? Would any other query style be useful?
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Gabriel Wicke Principal Engineer, Wikimedia Foundation
-- Gabriel Wicke Principal Engineer, Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Count Logula Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
And, per Tgr on Phabricator, option 2 means that urls are stable, so I can link to some data and expect it to show the same data later on.
-Sage
On Fri, Sep 11, 2015 at 5:08 PM, Jonathan Morgan jmorgan@wikimedia.org wrote:
I would prefer to have both days-in-past and start/end daterange options, along the lines of Adam's proposal.
But if I have to choose one, I concur with Leila. Start/end daterange offers more functionality.
Jonathan
On Fri, Sep 11, 2015 at 4:44 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Big +1 to Adam. Is the top articles the first deliverable we should expect?
On 11 September 2015 at 19:27, Gabriel Wicke gwicke@wikimedia.org wrote:
On Fri, Sep 11, 2015 at 4:26 PM, Gabriel Wicke gwicke@wikimedia.org wrote:
Another option would be a single entry point
/top/{project}/{access}/from/{start}{/end}
with support for negative indexes for 'days in the past':
/top/{project}/{access}/from/-30 /top/{project}/{access}/from/-60/-30
as well as full dates:
/top/en.wikipedia/all-access/2014-06-12/2015-08-30
Correction:
/top/en.wikipedia/all-access/from/2014-06-12/2015-08-30
On Fri, Sep 11, 2015 at 3:19 PM, paul@paulweiss.info wrote:
I concur with Leila.
Paul
--------- Original Message --------- Subject: Re: [Analytics] [Survey] Pageview API From: "Leila Zia" leila@wikimedia.org Date: 9/11/15 3:06 pm To: "A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics." analytics@lists.wikimedia.org
It's getting exciting. :-)
I'd go with choice 2 since it gives more control to the user while offering what the user can get through choice 1 as well.
Question: will we get page_ids or page_titles or both? It's good to have both.
Leila
On Fri, Sep 11, 2015 at 3:00 PM, Dan Andreescu dandreescu@wikimedia.org wrote:
Hi everyone. End of quarter is rapidly approaching and I wanted to ask a quick question about one of the endpoints we want to push out. We want to let you ask "what are the top articles" but we're not sure how to structure the URL so it's most useful to you. Here are the choices:
Choice 1. /top/{project}/{access}/{days-in-the-past}
Example: top articles via all en.wikipedia sites for the past 30 days: /top/en.wikipedia/all-access/30
Choice 2. /top/{project}/{access}/{start}/{end}
Example: top articles via all en.wikipedia sites from June 12th, 2014 to August 30th, 2015: /top/en.wikipedia/all-access/2014-06-12/2015-08-30
(in all of those,
- {project} means en.wikipedia, commons.wikimedia, etc.
- {access} means access method as in desktop, mobile web, mobile app
)
Which do you prefer? Would any other query style be useful?
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Gabriel Wicke Principal Engineer, Wikimedia Foundation
-- Gabriel Wicke Principal Engineer, Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Count Logula Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Jonathan T. Morgan Senior Design Researcher Wikimedia Foundation User:Jmorgan (WMF)
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
I'd also recommend doing both. +1 to the schema proposed by Gabriel.
I'm sure it'll come up eventually but, just a few thoughts:
* Both can be cached (one can be cached for <24h, the other longer). * The dynamic range allows useful linking to answer canonical questions regarding current trends. * The dynamic range can either be a redirect resolved by the web app, or it can simply do the response directly (I recommend the latter; but either can cache for <24h). It should probably specify "link rel=canonical" (HTTP header or HTML tag, for machines) with the expanded url, and in case of a UI it can advertise this as the "Permalink" (for humans).
-- Timo
On Sat, Sep 12, 2015 at 1:11 AM, Sage Ross ragesoss+wikipedia@gmail.com wrote:
And, per Tgr on Phabricator, option 2 means that urls are stable, so I can link to some data and expect it to show the same data later on.
-Sage
On Fri, Sep 11, 2015 at 5:08 PM, Jonathan Morgan jmorgan@wikimedia.org wrote:
I would prefer to have both days-in-past and start/end daterange options, along the lines of Adam's proposal.
But if I have to choose one, I concur with Leila. Start/end daterange
offers
more functionality.
Jonathan
On Fri, Sep 11, 2015 at 4:44 PM, Oliver Keyes okeyes@wikimedia.org
wrote:
Big +1 to Adam. Is the top articles the first deliverable we should expect?
On 11 September 2015 at 19:27, Gabriel Wicke gwicke@wikimedia.org
wrote:
On Fri, Sep 11, 2015 at 4:26 PM, Gabriel Wicke gwicke@wikimedia.org wrote:
Another option would be a single entry point
/top/{project}/{access}/from/{start}{/end}
with support for negative indexes for 'days in the past':
/top/{project}/{access}/from/-30 /top/{project}/{access}/from/-60/-30
as well as full dates:
/top/en.wikipedia/all-access/2014-06-12/2015-08-30
Correction:
/top/en.wikipedia/all-access/from/2014-06-12/2015-08-30
On Fri, Sep 11, 2015 at 3:19 PM, paul@paulweiss.info wrote:
I concur with Leila.
Paul
--------- Original Message --------- Subject: Re: [Analytics] [Survey] Pageview API From: "Leila Zia" leila@wikimedia.org Date: 9/11/15 3:06 pm To: "A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics." analytics@lists.wikimedia.org
It's getting exciting. :-)
I'd go with choice 2 since it gives more control to the user while offering what the user can get through choice 1 as well.
Question: will we get page_ids or page_titles or both? It's good to have both.
Leila
On Fri, Sep 11, 2015 at 3:00 PM, Dan Andreescu dandreescu@wikimedia.org wrote: > > Hi everyone. End of quarter is rapidly approaching and I wanted to > ask > a quick question about one of the endpoints we want to push out.
We
> want to > let you ask "what are the top articles" but we're not sure how to > structure > the URL so it's most useful to you. Here are the choices: > > Choice 1. /top/{project}/{access}/{days-in-the-past} > > Example: top articles via all en.wikipedia sites for the past 30 > days: > /top/en.wikipedia/all-access/30 > > > Choice 2. /top/{project}/{access}/{start}/{end} > > Example: top articles via all en.wikipedia sites from June 12th,
2014
> to > August 30th, 2015:
/top/en.wikipedia/all-access/2014-06-12/2015-08-30
> > > (in all of those, > > * {project} means en.wikipedia, commons.wikimedia, etc. > * {access} means access method as in desktop, mobile web, mobile
app
> > ) > > Which do you prefer? Would any other query style be useful? > > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________ Analytics mailing
list
Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Gabriel Wicke Principal Engineer, Wikimedia Foundation
-- Gabriel Wicke Principal Engineer, Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Count Logula Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Jonathan T. Morgan Senior Design Researcher Wikimedia Foundation User:Jmorgan (WMF)
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
This seems like a weird way to use restful URLs. Why not parameters?
-Toby
On Fri, Sep 11, 2015 at 4:27 PM, Gabriel Wicke gwicke@wikimedia.org wrote:
On Fri, Sep 11, 2015 at 4:26 PM, Gabriel Wicke gwicke@wikimedia.org wrote:
Another option would be a single entry point
/top/{project}/{access}/from/{start}{/end}
with support for negative indexes for 'days in the past':
/top/{project}/{access}/from/-30 /top/{project}/{access}/from/-60/-30
as well as full dates:
/top/en.wikipedia/all-access/2014-06-12/2015-08-30
Correction:
/top/en.wikipedia/all-access/from/2014-06-12/2015-08-30
On Fri, Sep 11, 2015 at 3:19 PM, paul@paulweiss.info wrote:
I concur with Leila.
Paul
--------- Original Message --------- Subject: Re: [Analytics] [Survey] Pageview API From: "Leila Zia" leila@wikimedia.org Date: 9/11/15 3:06 pm To: "A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics." analytics@lists.wikimedia.org
It's getting exciting. :-)
I'd go with choice 2 since it gives more control to the user while offering what the user can get through choice 1 as well.
Question: will we get page_ids or page_titles or both? It's good to have both.
Leila
On Fri, Sep 11, 2015 at 3:00 PM, Dan Andreescu <dandreescu@wikimedia.org
wrote:
Hi everyone. End of quarter is rapidly approaching and I wanted to ask a quick question about one of the endpoints we want to push out. We want to let you ask "what are the top articles" but we're not sure how to structure the URL so it's most useful to you. Here are the choices:
Choice 1. /top/{project}/{access}/{days-in-the-past}
Example: top articles via all en.wikipedia sites for the past 30 days: /top/en.wikipedia/all-access/30
Choice 2. /top/{project}/{access}/{start}/{end}
Example: top articles via all en.wikipedia sites from June 12th, 2014 to August 30th, 2015: /top/en.wikipedia/all-access/2014-06-12/2015-08-30
(in all of those,
- {project} means en.wikipedia, commons.wikimedia, etc.
- {access} means access method as in desktop, mobile web, mobile app
)
Which do you prefer? Would any other query style be useful?
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Gabriel Wicke Principal Engineer, Wikimedia Foundation
-- Gabriel Wicke Principal Engineer, Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Toby, main reason for REST paths over query strings is typically caching. With query strings and multiple parameters, the order and presence of parameters is not deterministic. You can use ?from=something&to=somethingElse or ?to=somethingElse&from=something, which both would be separate cache entries, which is an issue if you plan to cache for longer times & purge actively.
In this particular case it should actually be fine to rely on short time caching only, which means that query parameters are an option as well.
On Fri, Sep 11, 2015 at 4:46 PM, Toby Negrin tnegrin@wikimedia.org wrote:
This seems like a weird way to use restful URLs. Why not parameters?
-Toby
On Fri, Sep 11, 2015 at 4:27 PM, Gabriel Wicke gwicke@wikimedia.org wrote:
On Fri, Sep 11, 2015 at 4:26 PM, Gabriel Wicke gwicke@wikimedia.org wrote:
Another option would be a single entry point
/top/{project}/{access}/from/{start}{/end}
with support for negative indexes for 'days in the past':
/top/{project}/{access}/from/-30 /top/{project}/{access}/from/-60/-30
as well as full dates:
/top/en.wikipedia/all-access/2014-06-12/2015-08-30
Correction:
/top/en.wikipedia/all-access/from/2014-06-12/2015-08-30
On Fri, Sep 11, 2015 at 3:19 PM, paul@paulweiss.info wrote:
I concur with Leila.
Paul
--------- Original Message --------- Subject: Re: [Analytics] [Survey] Pageview API From: "Leila Zia" leila@wikimedia.org Date: 9/11/15 3:06 pm To: "A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics." <analytics@lists.wikimedia.org
It's getting exciting. :-)
I'd go with choice 2 since it gives more control to the user while offering what the user can get through choice 1 as well.
Question: will we get page_ids or page_titles or both? It's good to have both.
Leila
On Fri, Sep 11, 2015 at 3:00 PM, Dan Andreescu <dandreescu@wikimedia .org> wrote:
Hi everyone. End of quarter is rapidly approaching and I wanted to ask a quick question about one of the endpoints we want to push out. We want to let you ask "what are the top articles" but we're not sure how to structure the URL so it's most useful to you. Here are the choices:
Choice 1. /top/{project}/{access}/{days-in-the-past}
Example: top articles via all en.wikipedia sites for the past 30 days: /top/en.wikipedia/all-access/30
Choice 2. /top/{project}/{access}/{start}/{end}
Example: top articles via all en.wikipedia sites from June 12th, 2014 to August 30th, 2015: /top/en.wikipedia/all-access/2014-06-12/2015-08-30
(in all of those,
- {project} means en.wikipedia, commons.wikimedia, etc.
- {access} means access method as in desktop, mobile web, mobile app
)
Which do you prefer? Would any other query style be useful?
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Gabriel Wicke Principal Engineer, Wikimedia Foundation
-- Gabriel Wicke Principal Engineer, Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Hi all,
Two additional questions: (i) Are there plans for making this data available via this API at lower granularity (hourly, or even more fine grained, or even in streaming realtime form)? (ii) Are there plans for adding time zone support?
Thanks, Tom
By time zone support do you mean localising the server-side timestamps to the client location, and making the data available in a form divided-up like that?
On 13 September 2015 at 04:37, Thomas Steiner tomac@google.com wrote:
Hi all,
Two additional questions: (i) Are there plans for making this data available via this API at lower granularity (hourly, or even more fine grained, or even in streaming realtime form)? (ii) Are there plans for adding time zone support?
Thanks, Tom
-- Dr. Thomas Steiner, Employee, Google Inc. http://blog.tomayac.com, http://twitter.com/tomayac
-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux)
iFy0uwAntT0bE3xtRa5AfeCheCkthAtTh3reSabiGbl0ck0fjumBl3DCharaCTersAttH3b0ttom.hTtP5://xKcd.c0m/1181/ -----END PGP SIGNATURE-----
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
I mean that somehow I could express getting data in an exact given period of time, say, exactly the day September 11, 2015 in the time zone CET (that day started at 3pm relative to PDT or 11pm relative to UTC). Without time zone support, I would get data “outside” of my desired local time zone. Hope this makes sense and is clear.
On 13 September 2015 at 16:26, Thomas Steiner tomac@google.com wrote:
I mean that somehow I could express getting data in an exact given period of time, say, exactly the day September 11, 2015 in the time zone CET (that day started at 3pm relative to PDT or 11pm relative to UTC). Without time zone support, I would get data “outside” of my desired local time zone. Hope this makes sense and is clear.
A cautious note on time zones...
If you're holding everything in one hour bins, as we currently do with the aggregated data, then it's easy enough to switch from UTC to CET to EST and so forth.
But not all time zones differ by one hour increments. Most noticeably, India is on UTC+5:30, and a handful of other places also differ by 30 minutes from the standard (or in the case of Nepal, 45). I'm not sure you could display these without regenerating the underlying data, which would be a lot of added complexity.
Hi all,
My thoughts and opinion around entry-point definition.
While we have as a long-term plan to provide 'on-the-fly per-query computation', for now we pre-aggregate every dataset we want serve, and store it in cassandra to be exposed by restbase. It means we can't easily provide variable start/end aggregation easily.
We could either - send every dataset in between the start and end date for a given time granularity level (could be big !). - use '/top/{project}/{access}/{year}/{month}/{day}' entrypoint for instance, with possibility to skip the 'day' parameter to have full month.
*@Thomas*: - As Andrew said, the data we have is pre-aggregated at hour level so far. - The data is tagged in UTC timezone and we planned that requests would be using that timezone dy default. - As said in this message, we are thinking of ways to provide better access to data (on the fly computation, lower time granularity and others), and this involves both technical and privacy concern. It will be for future :)
Joseph
On Sun, Sep 13, 2015 at 5:39 PM, Andrew Gray andrew.gray@dunelm.org.uk wrote:
On 13 September 2015 at 16:26, Thomas Steiner tomac@google.com wrote:
I mean that somehow I could express getting data in an exact given
period of
time, say, exactly the day September 11, 2015 in the time zone CET (that
day
started at 3pm relative to PDT or 11pm relative to UTC). Without time
zone
support, I would get data “outside” of my desired local time zone. Hope
this
makes sense and is clear.
A cautious note on time zones...
If you're holding everything in one hour bins, as we currently do with the aggregated data, then it's easy enough to switch from UTC to CET to EST and so forth.
But not all time zones differ by one hour increments. Most noticeably, India is on UTC+5:30, and a handful of other places also differ by 30 minutes from the standard (or in the case of Nepal, 45). I'm not sure you could display these without regenerating the underlying data, which would be a lot of added complexity.
--
- Andrew Gray andrew.gray@dunelm.org.uk
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics