There is 3 different "bot" thingies you should know about, I'll briefly
describe
them each:
== The "bot" user right:
This is the right that grants the user the ability to perform an edit with a
"bot" flag.
Facts:
* Not all users with this right are "bots".
* The flag can be toggled on a per-edit basis. Bot software will activate this
flag. But an account can be used by humans and bot software simultaneously.
Dedicated bot accounts will typically have all its edits bot-flagged, but other
users may contribute regularly and also run a bot from time to time with their
credentials[1].
== The "bot" user group:
This user group is available in MediaWiki by default to make it possible to
grant a user the bot right (because user management goes by groups, not rights.
To grant a user the bot right, one adds the user to a group that provides that
right).
Facts:
* Group membership can change over time. There are many bot-flagged edits by
users that are no longer in a user group providing the bot right. Likewise there
are many edits not bot-flagged by users that now have the bot right (which they
may or may not use for each edit).
* Not all bots are a member of this group (there are other groups that provide
this right, sysop, for example).
== The "bot" flag:
This is the only reliable factor. This indicates most accurately that the edit
was intended as bot edit (and that the user could do so because they had the bot
user right when the edit was made).
It is especially reliable because the data is stored with the edit, not
calculated afterwards (so it is regardless of the user's group memberships at
time of query).
However it has one catch: The data is only stored in the recentchanges table,
from which it expires after 30 days. I guess this explains why the best way is
also the least common way to categorize bot edits in analytics (unless only
covering recent data).
There is an open bug to store the bot flag in the revision table, thus making it
permanently available[2].
-- Krinkle
[1] For example on Commons, where I am a sysop, there is a bot I ran that edits
sysop-protected pages, therefore I had to run this bot under my personal account
for a while, marking its edits as "bot". Most sysop-bots (including mine now)
have
a separate account which is then given membership to the "sysop" user group,
but this isn't always the case. For example on Wikipedia I know there's various
admins that use software to automatically block certain IP-addresses from time
to time (proxies, TOR, zombies, whatever). Some are ran on bot accounts, some
not.
[2]
https://bugzilla.wikimedia.org/show_bug.cgi?id=17237
On May 25, 2012, at 7:49 PM, Fabian Kaelin <fkaelin(a)wikimedia.org> wrote:
Hi,
Sorry about the length of this mail, it reads faster than it looks.
I am working with the recentchanges and the cu_changes (checkuser)
mediawiki SQL tables. I would like to be able to filter bot activity,
unfortunately I am increasingly confused.
Things that I think I know:
- In the
recentchanges<http://www.mediawiki.org/wiki/Manual:Recentchanges_table&g…
table
there is a `rc_bot` flag that should indicate whether the edit comes from a
bot.
- The checkuser table
cu_changes<http://www.mediawiki.org/wiki/Extension:CheckUser> (which
is not documented on the mediawiki database layout
page<http://www.mediawiki.org/wiki/Manual:Database_layout>)
contains mostly the same information as the recentchanges table but for a
longer period of time. However, there is no bot flag as there is on the
recentchanges table - I don't know why not.
- There is a `bot` entry in the
user_groups.ug_group<http://www.mediawiki.org/wiki/Manual:User_groups_ta…
field.
A revision/recentchanges/cu_changes entry can be identified as bot by
joining the original table with user_groups on the user_id and by setting
ug_group=`bot`.
- The user_groups method way of identifying bots is inefficient and the
data seems incomplete. For some other projects we have used various other
bot tables created by hand (on db1047: halfak.bot used during WSOR 2011 or
declerambaul.erik_bots containing the bots identified by Erik Zachte).
I would like to know the answers to the following questions:
1. *What is the meaning/purpose of the rc_bot flag on recentchanges? *There
are entries in the recentchanges table from editors that are flagged as
bots in the user_groups and the other bot tables but still have the rc_bot
flag set to 0.
mysql> select rc.rc_user_text from recentchanges rc join user_groups ug ON
(rc.rc_user=ug.ug_user) WHERE ug.ug_group = 'bot' and rc.rc_bot=0 limit 1;
+--------------+
| rc_user_text |
+--------------+
| ClueBot NG |
+--------------+
2. *Why is there no bot flag in the checkuser table? *A lot of the other
fields seem to be copied from the recentchanges table, why not the rc_bot
field? The check user table contains both entries that are flagged as bots
in the recentchanges table and entries that are flagged as bots in the
user_groups.
mysql> select cuc.cuc_user_text from recentchanges rc join cu_changes cuc
ON (rc.rc_user=cuc.cuc_user) WHERE rc.rc_bot=1 limit 1;
+---------------+
| cuc_user_text |
+---------------+
| MiszaBot III |
+---------------+
mysql> select cuc.cuc_user_text from cu_changes cuc join user_groups ug ON
(cuc.cuc_user=ug.ug_user) WHERE ug.ug_group = 'bot' limit 1;
+---------------+
| cuc_user_text |
+---------------+
| Robbot |
+---------------+
3. *Am I missing some fundamental information about how bots are handled?* This
is a frequently recurring request for data analytics and it seems the data
is inconsistent.
What is the most convenient, sane way to classify bot activity as such? Are
there any projects underway that aim to improve the situation? Any input,
pointers and recommendations are much appreciated.
Thanks a lot! Regards,
Fabian
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l