> The existing ua based classification of "spider" misses a lot of user-agents that are obviously bots
If a bot is identifying as such we should be marking it as "spider", please let us know what patterns you think we are missing.

Let's also have in mind that we have a lot of traffic of user agents that look lawful but we know (due to request frequency) that they are bots. Thus far we 
are not tackilng that problem.

Thanks, 

Nuria




On Fri, Nov 13, 2015 at 3:56 PM, Bryan Davis <bd808@wikimedia.org> wrote:
On Mon, Nov 2, 2015 at 10:27 AM, Nuria Ruiz <nuria@wikimedia.org> wrote:
> Team:
>
> Please take a look at Mediawiki API data needs, they made a nice wiki page
> for us to understand what type of data do they need.
>
> https://www.mediawiki.org/wiki/User:BDavis_%28WMF%29/Projects/Action_API_request_analytics
>
> We already talked with them about using our user_agent data on wmf table so
> they can start on those reports right away so you might see some oozie CRs
> on that regard. Please have in mind that API folks need raw user agents (as
> every API client should have a unique one) rather than processed ones.

I've updated the wiki page with some refined ideas and now have a
short section at the end that gives some really rough numbers that
I've taken from the existing wmf.webrequests data for 2015-11-01.

Some interesting things there at least for me and the people I've
shared these early findings with:

* api.php gets hit 450M+ times a day by 300K+ distinct user-agents
* 65 user-agents each make >1M requests per day
* The top user-agent is no user agent at all (missing/empty header)
* Only 1% of Action API traffic comes from WMF servers (excluding labs)
* The existing ua based classification of "spider" misses a lot of
user-agents that are obviously bots
* We have a lot of API consumers that are violating the posted policy
of using a unique ua for requests

I have a couple of small patches up for review [0][1] to introduce a
new UDF that can be used to classify an IP address as coming from an
internal, external or labs host. I hope to have some oozie and hive
scripts for review by the end of next week.

[0]: https://gerrit.wikimedia.org/r/#/c/253045/
[1]: https://gerrit.wikimedia.org/r/#/c/253046/

Bryan
--
Bryan Davis              Wikimedia Foundation    <bd808@wikimedia.org>
[[m:User:BDavis_(WMF)]]  Sr Software Engineer            Boise, ID USA
irc: bd808                                        v:415.839.6885 x6855