Re: [Analytics] [Survey] Pageview API

16 Sep 2015


      Hadoop was originally built for indexing the web by processing the web map
and exporting indexes to serving systems. I think integration with Elastic
Search would work well.
-Toby
On Wed, Sep 16, 2015 at 7:03 AM, Joseph Allemandou <
jallemandou@wikimedia.org> wrote:
...
@Erik:
Reading this thread makes me think that it might be interesting to have a
chat around using hadoop for indexing (
https://github.com/elastic/elasticsearch-hadoop).
I have no idea how you currently index, but I'd love to learn :)
Please let me know if you think it could be useful !
Joseph
On Wed, Sep 16, 2015 at 5:15 AM, Erik Bernhardson <
ebernhardson@wikimedia.org> wrote:
...
makes sense. We will indeed be doing a batch process once a week to build
the completion indices which ideally will run through all the wiki's in a
day. We are going to do some analysis into how up to date our page view
data really needs to be for scoring purposes though, if we can get good
scoring results while only updating page view info when a page is edited we
might be able to spread out the load across time that way and just hit the
page view api once for each edit. Otherwise i'm sure we can do as suggested
earlier and pull the data from hive directly and stuff into a temporary
structure we can query while building the completion indices.
On Tue, Sep 15, 2015 at 7:16 PM, Dan Andreescu dandreescu@wikimedia.org
wrote:
...
On Tue, Sep 15, 2015 at 6:56 PM, Marko Obrovac mobrovac@wikimedia.org
wrote:
...
On 15 September 2015 at 19:37, Dan Andreescu dandreescu@wikimedia.org
wrote:
...
I worry a little bit about the performance without having a batch api,
...
but we can certainly try it out and see what happens. Basically we will be
requesting the page view information for every NS_MAIN article in every
wiki once a week.  A quick sum against our search  cluster suggests this is
~96 million api requests.
96m equals approx 160 req/s which is more than sustainable for RESTBase.
True, if we distributed the load over the whole week, but I think Erik
needs the results to be available weekly, as in, probably within a day or
so of issuing the request.  Of course, if we were to serve this kind of
request from the API, we would make a better batch-query endpoint for his
use case.  But I think it might be hard to make that useful generally.  I
think for now, let's just collect these one-off pageview querying use cases
and slowly build them into the API when we can generalize two or more of
them into one endpoint.

Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
--
*Joseph Allemandou*
Data Engineer @ Wikimedia Foundation
IRC: joal

Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Re: [Analytics] [Survey] Pageview API