Re: [Wiki-research-l] Kill the bots

21 May 2014

Okay. Methodology:

*take the last 5 days of requestlogs;
*Filter them down to text/html requests as a heuristic for non-API requests;
*Run them through the UA parser we use;
*Exclude spiders and things which reported valid browsers;
*Aggregate the user agents left;
*???
*Profit

It looks like there are a relatively small number of bots that
browse/interact via the web - ones I can identify include WPCleaner[0],
which is semi-automated, something I can't find through WP or google called
"DigitalsmithsBot" (could be internal, could be external), and Hoo Bot (run
by User:Hoo man). My biggest concern is DotNetWikiBot, which is a general
framework that could be masking multiple underlying bots and has ~ 7.4m
requests through the web interface in that time period.

Obvious caveat is obvious; the edits from these tools may actually come
through the API, but they're choosing to request content through the web
interface for some weird reason. I don't know enough about the software
behind each bot to comment on that. I can try explicitly looking for
web-based edit attempts, but there would be far fewer observations that the
bots might appear in, because the underlying dataset is sampled at a 1:1000
rate.

[0] https://en.wikipedia.org/wiki/User:NicoV/Wikipedia_Cleaner/Documentation

On 20 May 2014 07:50, Oliver Keyes &lt;okeyes(a)wikimedia.org&gt; wrote:

...
  Actually, belay that, I have a pretty good idea.
I'll fire the log parser
 up now.

 On 20 May 2014 01:21, Oliver Keyes &lt;okeyes(a)wikimedia.org&gt; wrote:

  I think a *lot* of them use the API, but I
don't know off the top of my
 head if it's *all* of them. If only we knew somebody who has spent the
 last 3 months staring into the cthulian nightmare of our request logs and
 could look this up...

 More seriously; drop me a note off-list so that I can try to work out
 precisely what you need me to find out, and I'll write a quick-and-dirty
 parser of our sampled logs to drag the answer kicking and screaming into
 the light.

 (sorry, it's annual review season. That always gets me blithe.)

 On 19 May 2014 13:03, Scott Hale &lt;computermacgyver(a)gmail.com&gt; wrote:

  Thanks all for the comments on my paper, and even
more thanks to
 everyone sharing these super helpful ideas on filtering bots: this is why I
 love the Wikipedia research committee.

 I think Oliver is definitely right that

   this would be a useful topic for some piece of
method-comparing
 research, if anyone is looking for paper ideas. 
 "Citation goldmine" as one friend called it, I think.

 This won't address edit logs to date, but do  we know if most bots and
 automated tools use the API to make edits? If so, would it be feasibility
 to add a flag to each edit as to whether it came through the API or not.
 This won't stop determined users, but might be a nice way to identify
 cyborg edits from those made manually by the same user for many of the
 standard tools going forward.

 The closest thing I found in the bug tracker is [1], but it doesn't
 address the issue of 'what is a bot' which this thread has clearly shown is
 quite complex. An API-edit vs. non-API edit might be a way forward unless
 there are automated tools/bots that don't use the API.

 1. https://bugzilla.wikimedia.org/show_bug.cgi?id=11181

 Cheers,
 Scott

 _______________________________________________
 Wiki-research-l mailing list
 Wiki-research-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

 --
 Oliver Keyes
 Research Analyst
 Wikimedia Foundation

 --
 Oliver Keyes
 Research Analyst
 Wikimedia Foundation

-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [Wiki-research-l] Kill the bots