On Mon, Nov 2, 2015 at 10:27 AM, Nuria Ruiz <nuria@wikimedia.org> wrote:
> Team:
>
> Please take a look at Mediawiki API data needs, they made a nice wiki page
> for us to understand what type of data do they need.
>
> https://www.mediawiki.org/wiki/User:BDavis_%28WMF%29/Projects/Action_API_request_analytics
>
> We already talked with them about using our user_agent data on wmf table so
> they can start on those reports right away so you might see some oozie CRs
> on that regard. Please have in mind that API folks need raw user agents (as
> every API client should have a unique one) rather than processed ones.
I've updated the wiki page with some refined ideas and now have a
short section at the end that gives some really rough numbers that
I've taken from the existing wmf.webrequests data for 2015-11-01.
Some interesting things there at least for me and the people I've
shared these early findings with:
* api.php gets hit 450M+ times a day by 300K+ distinct user-agents
* 65 user-agents each make >1M requests per day
* The top user-agent is no user agent at all (missing/empty header)
* Only 1% of Action API traffic comes from WMF servers (excluding labs)
* The existing ua based classification of "spider" misses a lot of
user-agents that are obviously bots
* We have a lot of API consumers that are violating the posted policy
of using a unique ua for requests
I have a couple of small patches up for review [0][1] to introduce a
new UDF that can be used to classify an IP address as coming from an
internal, external or labs host. I hope to have some oozie and hive
scripts for review by the end of next week.
[0]: https://gerrit.wikimedia.org/r/#/c/253045/
[1]: https://gerrit.wikimedia.org/r/#/c/253046/
Bryan
--
Bryan Davis Wikimedia Foundation <bd808@wikimedia.org>
[[m:User:BDavis_(WMF)]] Sr Software Engineer Boise, ID USA
irc: bd808 v:415.839.6885 x6855