On Mon, Nov 2, 2015 at 10:27 AM, Nuria Ruiz nuria@wikimedia.org wrote:
Team:
Please take a look at Mediawiki API data needs, they made a nice wiki page for us to understand what type of data do they need.
https://www.mediawiki.org/wiki/User:BDavis_%28WMF%29/Projects/Action_API_req...
We already talked with them about using our user_agent data on wmf table so they can start on those reports right away so you might see some oozie CRs on that regard. Please have in mind that API folks need raw user agents (as every API client should have a unique one) rather than processed ones.
I've updated the wiki page with some refined ideas and now have a short section at the end that gives some really rough numbers that I've taken from the existing wmf.webrequests data for 2015-11-01.
Some interesting things there at least for me and the people I've shared these early findings with:
* api.php gets hit 450M+ times a day by 300K+ distinct user-agents * 65 user-agents each make >1M requests per day * The top user-agent is no user agent at all (missing/empty header) * Only 1% of Action API traffic comes from WMF servers (excluding labs) * The existing ua based classification of "spider" misses a lot of user-agents that are obviously bots * We have a lot of API consumers that are violating the posted policy of using a unique ua for requests
I have a couple of small patches up for review [0][1] to introduce a new UDF that can be used to classify an IP address as coming from an internal, external or labs host. I hope to have some oozie and hive scripts for review by the end of next week.
[0]: https://gerrit.wikimedia.org/r/#/c/253045/ [1]: https://gerrit.wikimedia.org/r/#/c/253046/
Bryan