Team:
Please take a look at Mediawiki API data needs, they made a nice wiki page for us to understand what type of data do they need.
https://www.mediawiki.org/wiki/User:BDavis_%28WMF%29/Projects/Action_API_req...
We already talked with them about using our user_agent data on wmf table so they can start on those reports right away so you might see some oozie CRs on that regard. Please have in mind that API folks need raw user agents (as every API client should have a unique one) rather than processed ones.
Thanks,
Nuria
On Mon, Nov 2, 2015 at 10:27 AM, Nuria Ruiz nuria@wikimedia.org wrote:
Team:
Please take a look at Mediawiki API data needs, they made a nice wiki page for us to understand what type of data do they need.
https://www.mediawiki.org/wiki/User:BDavis_%28WMF%29/Projects/Action_API_req...
We already talked with them about using our user_agent data on wmf table so they can start on those reports right away so you might see some oozie CRs on that regard. Please have in mind that API folks need raw user agents (as every API client should have a unique one) rather than processed ones.
I've updated the wiki page with some refined ideas and now have a short section at the end that gives some really rough numbers that I've taken from the existing wmf.webrequests data for 2015-11-01.
Some interesting things there at least for me and the people I've shared these early findings with:
* api.php gets hit 450M+ times a day by 300K+ distinct user-agents * 65 user-agents each make >1M requests per day * The top user-agent is no user agent at all (missing/empty header) * Only 1% of Action API traffic comes from WMF servers (excluding labs) * The existing ua based classification of "spider" misses a lot of user-agents that are obviously bots * We have a lot of API consumers that are violating the posted policy of using a unique ua for requests
I have a couple of small patches up for review [0][1] to introduce a new UDF that can be used to classify an IP address as coming from an internal, external or labs host. I hope to have some oozie and hive scripts for review by the end of next week.
[0]: https://gerrit.wikimedia.org/r/#/c/253045/ [1]: https://gerrit.wikimedia.org/r/#/c/253046/
Bryan
The existing ua based classification of "spider" misses a lot of user-agents
that are obviously bots If a bot is identifying as such we should be marking it as "spider", please let us know what patterns you think we are missing.
Let's also have in mind that we have a lot of traffic of user agents that look lawful but we know (due to request frequency) that they are bots. Thus far we are not tackilng that problem.
Thanks,
Nuria
On Fri, Nov 13, 2015 at 3:56 PM, Bryan Davis bd808@wikimedia.org wrote:
On Mon, Nov 2, 2015 at 10:27 AM, Nuria Ruiz nuria@wikimedia.org wrote:
Team:
Please take a look at Mediawiki API data needs, they made a nice wiki
page
for us to understand what type of data do they need.
https://www.mediawiki.org/wiki/User:BDavis_%28WMF%29/Projects/Action_API_req...
We already talked with them about using our user_agent data on wmf table
so
they can start on those reports right away so you might see some oozie
CRs
on that regard. Please have in mind that API folks need raw user agents
(as
every API client should have a unique one) rather than processed ones.
I've updated the wiki page with some refined ideas and now have a short section at the end that gives some really rough numbers that I've taken from the existing wmf.webrequests data for 2015-11-01.
Some interesting things there at least for me and the people I've shared these early findings with:
- api.php gets hit 450M+ times a day by 300K+ distinct user-agents
- 65 user-agents each make >1M requests per day
- The top user-agent is no user agent at all (missing/empty header)
- Only 1% of Action API traffic comes from WMF servers (excluding labs)
- The existing ua based classification of "spider" misses a lot of
user-agents that are obviously bots
- We have a lot of API consumers that are violating the posted policy
of using a unique ua for requests
I have a couple of small patches up for review [0][1] to introduce a new UDF that can be used to classify an IP address as coming from an internal, external or labs host. I hope to have some oozie and hive scripts for review by the end of next week.
Bryan
Bryan Davis Wikimedia Foundation bd808@wikimedia.org [[m:User:BDavis_(WMF)]] Sr Software Engineer Boise, ID USA irc: bd808 v:415.839.6885 x6855
On Fri, Nov 13, 2015 at 3:56 PM, Bryan Davis bd808@wikimedia.org wrote:
- Only 1% of Action API traffic comes from WMF servers (excluding labs)
This is a lot lower than I'd expect. Is this based on Varnish logs, or specific logs / metrics emitted by the action API code itself? If it is using Varnish logs, then most internal requests hitting LVS directly won't be included.
Gabriel
On Fri, Nov 13, 2015 at 6:40 PM, Gabriel Wicke gwicke@wikimedia.org wrote:
On Fri, Nov 13, 2015 at 3:56 PM, Bryan Davis bd808@wikimedia.org wrote:
- Only 1% of Action API traffic comes from WMF servers (excluding labs)
This is a lot lower than I'd expect. Is this based on Varnish logs, or specific logs / metrics emitted by the action API code itself? If it is using Varnish logs, then most internal requests hitting LVS directly won't be included.
Good point Gabriel. This is data taken from the current wfm.webrequests table in Hive so it is based on Varnish traffic. I will be implementing a data feed that is taken directly from the backend MW servers in the coming weeks and that data may tell a different story for Parsoid requests.
Bryan