Re: [Analytics] [Survey] Pageview API

22 Sep 2015


      On 22 September 2015 at 14:40, Oliver Keyes okeyes@wikimedia.org wrote:
...
On 22 September 2015 at 05:10, Marko Obrovac mobrovac@wikimedia.org
wrote:
...
Hello,
Just a small note which I don't think has been voiced thus far. There
will
...
actually be two APIs - one exposed by the Analytics' RESTBase instance,
which will be accessible only from inside of WMF's infrastructure, and
another, public-facing one (exposed by the Services' RESTBase instance).
Now, these may be identical (both in layout and functionality) or may
(slightly) differ. Which way to go? The big pro of them being identical
is
...
that the client wouldn't need to care which RESTBase instance it is
actually
...
contacting. That would also ease API maintenance. On the down side, that
increases the overhead for Analytics to keep their domain list in sync.
Having a more specialised API for the Analytics instance, on the other
hand,
...
would allow us to tailor it more for real internal use cases instead of
focusing on the overall API coherence (which we need to do for the
public-facing API). I'd honestly vote for that option.
Can you give an example of internal-facing use cases you don't see a
broader population of consumers being interested in?
In my mail I was mostly hinting to the fact that the public-facing API is
divided by domains, whilst the notion of projects is better suited for
Analytics. So the internal API could be organised around projects while
still supporting domains but in a looser format than the public one.
We plan to support arbitrary projects (such as en-all, all-wiktionary, etc)
on the public side as well, but because of the current layout, the
analytics' (public) API will be fragmented. There is no need to do such a
thing with the internal API too.
To concretely answer the question, I am not aware of any specific use case.
Just pointing out that internal users can, if they need/want, rely on
projects rather than on domains.
Cheers,
Marko
...
...
On 16 September 2015 at 16:06, Toby Negrin tnegrin@wikimedia.org
wrote:
...
...
Hadoop was originally built for indexing the web by processing the web
map
...
...
and exporting indexes to serving systems. I think integration with
Elastic
...
...
Search would work well.
Right, both are indexing systems (so to speak), but the former is for
offline use, while the latter targets online use. Ideally, we should make
them cooperate to get the best out of both worlds.
Cheers,
Marko
...
-Toby
On Wed, Sep 16, 2015 at 7:03 AM, Joseph Allemandou
jallemandou@wikimedia.org wrote:
...
@Erik:
Reading this thread makes me think that it might be interesting to
have a
...
...
...
chat around using hadoop for indexing
(https://github.com/elastic/elasticsearch-hadoop).
I have no idea how you currently index, but I'd love to learn :)
Please let me know if you think it could be useful !
Joseph
On Wed, Sep 16, 2015 at 5:15 AM, Erik Bernhardson
ebernhardson@wikimedia.org wrote:
...
makes sense. We will indeed be doing a batch process once a week to
build the completion indices which ideally will run through all the
wiki's
...
...
...
...
in a day. We are going to do some analysis into how up to date our
page view
...
...
...
...
data really needs to be for scoring purposes though, if we can get
good
...
...
...
...
scoring results while only updating page view info when a page is
edited we
...
...
...
...
might be able to spread out the load across time that way and just
hit the
...
...
...
...
page view api once for each edit. Otherwise i'm sure we can do as
suggested
...
...
...
...
earlier and pull the data from hive directly and stuff into a
temporary
...
...
...
...
structure we can query while building the completion indices.
On Tue, Sep 15, 2015 at 7:16 PM, Dan Andreescu
dandreescu@wikimedia.org wrote:
...
On Tue, Sep 15, 2015 at 6:56 PM, Marko Obrovac <
mobrovac@wikimedia.org>
...
...
...
...
...
wrote:
>
> On 15 September 2015 at 19:37, Dan Andreescu
> dandreescu@wikimedia.org wrote:
>>>
>>> I worry a little bit about the performance without having a batch
>>> api, but we can certainly try it out and see what happens.
Basically we will
...
...
...
...
...
>>> be requesting the page view information for every NS_MAIN article
in every
...
...
...
...
...
>>> wiki once a week.  A quick sum against our search  cluster
suggests this is
...
...
...
...
...
>>> ~96 million api requests.
>
>
> 96m equals approx 160 req/s which is more than sustainable for
> RESTBase.
True, if we distributed the load over the whole week, but I think
Erik
...
...
...
...
...
needs the results to be available weekly, as in, probably within a
day or so
...
...
...
...
...
of issuing the request.  Of course, if we were to serve this kind of
request
...
...
...
...
...
from the API, we would make a better batch-query endpoint for his
use case.
...
...
...
...
...
But I think it might be hard to make that useful generally.  I think
for
...
...
...
...
...
now, let's just collect these one-off pageview querying use cases
and slowly
...
...
...
...
...
build them into the API when we can generalize two or more of them
into one
...
...
...
...
...
endpoint.

Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
--
Joseph Allemandou
Data Engineer @ Wikimedia Foundation
IRC: joal

Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
--
Marko Obrovac, PhD
Senior Services Engineer
Wikimedia Foundation

Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
--
Oliver Keyes
Count Logula
Wikimedia Foundation

Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
-- 
Marko Obrovac, PhD
Senior Services Engineer
Wikimedia Foundation

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Re: [Analytics] [Survey] Pageview API