On 15 September 2015 at 19:37, Dan Andreescu <dandreescu@wikimedia.org> wrote:

I worry a little bit about the performance without having a batch api, but we can certainly try it out and see what happens. Basically we will be requesting the page view information for every NS_MAIN article in every wiki once a week. A quick sum against our search cluster suggests this is ~96 million api requests.

96m equals approx 160 req/s which is more than sustainable for RESTBase.

Oh, sorry, I thought you meant you were just querying 100 or so titles! In the case of huge queries like these, you should just query the wmf.pageview_hourly table directly. You can do so with plain SQL via Hive or maybe Impala if we end up setting that up. But those queries should be really fast in that table. We can help you write the query if you send us an attempt and a spec of exactly what you need.

My performance-oriented nature would also think about something like that, but I think this is not a decision that is to be taken lightly. While having an API doesn't come for free, its beauty lies in the abstraction. Concretely, as a pageview client, I am aware of the "contract" between the service and myself and as such, I trust it to fulfil its part of the job. How it does it is completely irrelevant to me, thus giving me the opportunity to focus on "my part of the job" (no need for me to worry about the internals of the implementation).

That said, it is clear as day that making 100 requests versus making one batch request takes more time. However, on the one hand, it sounds like Erik's use case is not latency- (or time-) sensitive. On the other, given the nature of the pageview API, the cost of computing the result (in case it is not available right away) dwarfs any connection or other related overheads.

Cheers,

Marko

Marko Obrovac, PhD

Senior Services Engineer

Wikimedia Foundation