The existing ua based classification of "spider" misses a lot of user-agents
that are obviously bots If a bot is identifying as such we should be marking it as "spider", please let us know what patterns you think we are missing.
Let's also have in mind that we have a lot of traffic of user agents that look lawful but we know (due to request frequency) that they are bots. Thus far we are not tackilng that problem.
Thanks,
Nuria
On Fri, Nov 13, 2015 at 3:56 PM, Bryan Davis bd808@wikimedia.org wrote:
On Mon, Nov 2, 2015 at 10:27 AM, Nuria Ruiz nuria@wikimedia.org wrote:
Team:
Please take a look at Mediawiki API data needs, they made a nice wiki
page
for us to understand what type of data do they need.
https://www.mediawiki.org/wiki/User:BDavis_%28WMF%29/Projects/Action_API_req...
We already talked with them about using our user_agent data on wmf table
so
they can start on those reports right away so you might see some oozie
CRs
on that regard. Please have in mind that API folks need raw user agents
(as
every API client should have a unique one) rather than processed ones.
I've updated the wiki page with some refined ideas and now have a short section at the end that gives some really rough numbers that I've taken from the existing wmf.webrequests data for 2015-11-01.
Some interesting things there at least for me and the people I've shared these early findings with:
- api.php gets hit 450M+ times a day by 300K+ distinct user-agents
- 65 user-agents each make >1M requests per day
- The top user-agent is no user agent at all (missing/empty header)
- Only 1% of Action API traffic comes from WMF servers (excluding labs)
- The existing ua based classification of "spider" misses a lot of
user-agents that are obviously bots
- We have a lot of API consumers that are violating the posted policy
of using a unique ua for requests
I have a couple of small patches up for review [0][1] to introduce a new UDF that can be used to classify an IP address as coming from an internal, external or labs host. I hope to have some oozie and hive scripts for review by the end of next week.
Bryan
Bryan Davis Wikimedia Foundation bd808@wikimedia.org [[m:User:BDavis_(WMF)]] Sr Software Engineer Boise, ID USA irc: bd808 v:415.839.6885 x6855