On 19 May 2014 08:17, Brian Keegan <b.keegan(a)neu.edu> wrote:
Another thought I had was that because many semi-automated tools such as
Twinkle and AWB leave parenthetical annotations in their revision comments,
would this be a relatively inexpensive way to filter out revisions rather
than users? Some caveats, I'd like to get domain experts' feedback on. I'm
not expecting settled research, just input from others' experiences munging
the data.
1. Is the inclusion of this markup in revision comments optional? This is
a concern that some users may enable or disable it, so I may end up biasing
inclusion based on users' preferences.
With some tools it is; specifically, I think Twinkle makes the [[WP:TW|TW]]
postfix in edits optional.[0]
2. How have these flags or markup changed over time?
This is a concern
that Twinke/AWB/etc. may have started/stopped including flags or changed
what they included over time.
I believe so. This is not so much a problem of tool development (in some
cases, i.e. twinkle, it will be, because you've got a tool there that's
been operating for ages, but some are quite new); it's more overall changes
to the composition of the bot/semi-automated assistance ecosystem. Tools
come to exist and are used and die and are replaced.
3. Are there other API queries or data elsewhere I
could use to identify
(semi-)automated revisions?
I'm happy to grab you the full histories of the relevant users/articles in
a TSV if you want; hit me up offlist (the same goes to any other non-WMF
researchers asking for non-PII; if you don't have labs/toolserver access
and need data, ask us!)
[0] see
https://en.wikipedia.org/wiki/Wikipedia:TWPREFS
--
Oliver Keyes
Research Analyst
Wikimedia Foundation
On 19 May 2014 08:17, Brian Keegan <b.keegan(a)neu.edu> wrote:
> Thanks for all the references and excellent advice so far!
>
> I've looked into the Hale Anti-Bot Method™, but because I've sampled my
> corpus on articles (based on category co-membership), the resulting groupby
> users gives these semi-automated users more "normal" distributions since
> their other contributions are censored. In other words, I see only a
> fraction of these users' contributions and thus the resulting time
> intervals I observe are spaced farther apart (more typical) than they
> actually are. It's not feasible for me to get 100k+ users' histories just
> for the purposes of cleaning up ~6k articles' histories.
>
> Another thought I had was that because many semi-automated tools such as
> Twinkle and AWB leave parenthetical annotations in their revision comments,
> would this be a relatively inexpensive way to filter out revisions rather
> than users? Some caveats, I'd like to get domain experts' feedback on.
I'm
> not expecting settled research, just input from others' experiences munging
> the data.
>
> 1. Is the inclusion of this markup in revision comments optional? This is
> a concern that some users may enable or disable it, so I may end up biasing
> inclusion based on users' preferences.
> 2. How have these flags or markup changed over time? This is a concern
> that Twinke/AWB/etc. may have started/stopped including flags or changed
> what they included over time.
3. Are there other API queries or data elsewhere I
could use to identify
(semi-)automated revisions?
> On Mon, May 19, 2014 at 10:35 AM, Federico Leva (Nemo) <nemowiki(a)gmail.com
> > wrote:
>
>> Brian Keegan, 18/05/2014 18:10:
>>
>> Is there a way to retrieve a canonical list of bots on enwiki or
>>> elsewhere?
>>>
>>
>> A Bots.csv list exists.
https://meta.wikimedia.org/wiki/Wikistat_csv
>> In general: please edit
https://meta.wikimedia.org/
>> wiki/Research:Identifying_bot_accounts
>>
>> Nemo
>>
>>
>> _______________________________________________
>> Wiki-research-l mailing list
>> Wiki-research-l(a)lists.wikimedia.org
>>
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>
>
>
>
> --
> Brian C. Keegan, Ph.D.
> Post-Doctoral Research Fellow, Lazer Lab
> College of Social Sciences and Humanities, Northeastern University
> Fellow, Institute for Quantitative Social Sciences, Harvard University
> Affiliate, Berkman Center for Internet & Society, Harvard Law School
>
> b.keegan(a)neu.edu
>
www.brianckeegan.com
> M: 617.803.6971
> O: 617.373.7200
> Skype: bckeegan
>
> _______________________________________________
> Wiki-research-l mailing list
> Wiki-research-l(a)lists.wikimedia.org
>
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>
--
Oliver Keyes
Research Analyst
Wikimedia Foundation