On 19 May 2014 08:17, Brian Keegan <b.keegan@neu.edu> wrote:

Another thought I had was that because many semi-automated tools such as Twinkle and AWB leave parenthetical annotations in their revision comments, would this be a relatively inexpensive way to filter out revisions rather than users? Some caveats, I'd like to get domain experts' feedback on. I'm not expecting settled research, just input from others' experiences munging the data.

1. Is the inclusion of this markup in revision comments optional? This is a concern that some users may enable or disable it, so I may end up biasing inclusion based on users' preferences. 
 
With some tools it is; specifically, I think Twinkle makes the [[WP:TW|TW]] postfix in edits optional.[0]

 
2. How have these flags or markup changed over time? This is a concern that Twinke/AWB/etc. may have started/stopped including flags or changed what they included over time. 

I believe so. This is not so much a problem of tool development (in some cases, i.e. twinkle, it will be, because you've got a tool there that's been operating for ages, but some are quite new); it's more overall changes to the composition of the bot/semi-automated assistance ecosystem. Tools come to exist and are used and die and are replaced.
 
3. Are there other API queries or data elsewhere I could use to identify (semi-)automated revisions?


I'm happy to grab you the full histories of the relevant users/articles in a TSV if you want; hit me up offlist (the same goes to any other non-WMF researchers asking for non-PII; if you don't have labs/toolserver access and need data, ask us!)


[0] see https://en.wikipedia.org/wiki/Wikipedia:TWPREFS
--
Oliver Keyes
Research Analyst
Wikimedia Foundation


On 19 May 2014 08:17, Brian Keegan <b.keegan@neu.edu> wrote:
Thanks for all the references and excellent advice so far!

I've looked into the Hale Anti-Bot Method™, but because I've sampled my corpus on articles (based on category co-membership), the resulting groupby users gives these semi-automated users more "normal" distributions since their other contributions are censored. In other words, I see only a fraction of these users' contributions and thus the resulting time intervals I observe are spaced farther apart (more typical) than they actually are. It's not feasible for me to get 100k+ users' histories just for the purposes of cleaning up ~6k articles' histories.

Another thought I had was that because many semi-automated tools such as Twinkle and AWB leave parenthetical annotations in their revision comments, would this be a relatively inexpensive way to filter out revisions rather than users? Some caveats, I'd like to get domain experts' feedback on. I'm not expecting settled research, just input from others' experiences munging the data.

1. Is the inclusion of this markup in revision comments optional? This is a concern that some users may enable or disable it, so I may end up biasing inclusion based on users' preferences. 
2. How have these flags or markup changed over time? This is a concern that Twinke/AWB/etc. may have started/stopped including flags or changed what they included over time. 
3. Are there other API queries or data elsewhere I could use to identify (semi-)automated revisions?


On Mon, May 19, 2014 at 10:35 AM, Federico Leva (Nemo) <nemowiki@gmail.com> wrote:
Brian Keegan, 18/05/2014 18:10:

Is there a way to retrieve a canonical list of bots on enwiki or elsewhere?

A Bots.csv list exists. https://meta.wikimedia.org/wiki/Wikistat_csv
In general: please edit https://meta.wikimedia.org/wiki/Research:Identifying_bot_accounts

Nemo


_______________________________________________
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



--
Brian C. Keegan, Ph.D.
Post-Doctoral Research Fellow, Lazer Lab
College of Social Sciences and Humanities, Northeastern University
Fellow, Institute for Quantitative Social Sciences, Harvard University
Affiliate, Berkman Center for Internet & Society, Harvard Law School

Skype: bckeegan

_______________________________________________
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




--
Oliver Keyes
Research Analyst
Wikimedia Foundation