that are obviously bots
If a bot is identifying as such we should be marking it as "spider", please
let us know what patterns you think we are missing.
Let's also have in mind that we have a lot of traffic of user agents that
look lawful but we know (due to request frequency) that they are bots. Thus
far we
are not tackilng that problem.
Thanks,
Nuria
On Fri, Nov 13, 2015 at 3:56 PM, Bryan Davis <bd808(a)wikimedia.org> wrote:
On Mon, Nov 2, 2015 at 10:27 AM, Nuria Ruiz
<nuria(a)wikimedia.org> wrote:
Team:
Please take a look at Mediawiki API data needs, they made a nice wiki
page
for us to understand what type of data do they
need.
https://www.mediawiki.org/wiki/User:BDavis_%28WMF%29/Projects/Action_API_re…
We already talked with them about using our user_agent data on wmf table
so
they can start on those reports right away so you
might see some oozie
CRs
on that regard. Please have in mind that API
folks need raw user agents
(as
every API client should have a unique one) rather
than processed ones.
I've updated the wiki page with some refined ideas and now have a
short section at the end that gives some really rough numbers that
I've taken from the existing wmf.webrequests data for 2015-11-01.
Some interesting things there at least for me and the people I've
shared these early findings with:
* api.php gets hit 450M+ times a day by 300K+ distinct user-agents
* 65 user-agents each make >1M requests per day
* The top user-agent is no user agent at all (missing/empty header)
* Only 1% of Action API traffic comes from WMF servers (excluding labs)
* The existing ua based classification of "spider" misses a lot of
user-agents that are obviously bots
* We have a lot of API consumers that are violating the posted policy
of using a unique ua for requests
I have a couple of small patches up for review [0][1] to introduce a
new UDF that can be used to classify an IP address as coming from an
internal, external or labs host. I hope to have some oozie and hive
scripts for review by the end of next week.
[0]:
https://gerrit.wikimedia.org/r/#/c/253045/
[1]:
https://gerrit.wikimedia.org/r/#/c/253046/
Bryan
--
Bryan Davis Wikimedia Foundation <bd808(a)wikimedia.org>
[[m:User:BDavis_(WMF)]] Sr Software Engineer Boise, ID USA
irc: bd808 v:415.839.6885 x6855