There is 3 different "bot" thingies you should know about, I'll briefly describe them each:
== The "bot" user right:
This is the right that grants the user the ability to perform an edit with a "bot" flag.
Facts: * Not all users with this right are "bots". * The flag can be toggled on a per-edit basis. Bot software will activate this flag. But an account can be used by humans and bot software simultaneously. Dedicated bot accounts will typically have all its edits bot-flagged, but other users may contribute regularly and also run a bot from time to time with their credentials[1].
== The "bot" user group:
This user group is available in MediaWiki by default to make it possible to grant a user the bot right (because user management goes by groups, not rights. To grant a user the bot right, one adds the user to a group that provides that right).
Facts: * Group membership can change over time. There are many bot-flagged edits by users that are no longer in a user group providing the bot right. Likewise there are many edits not bot-flagged by users that now have the bot right (which they may or may not use for each edit). * Not all bots are a member of this group (there are other groups that provide this right, sysop, for example).
== The "bot" flag:
This is the only reliable factor. This indicates most accurately that the edit was intended as bot edit (and that the user could do so because they had the bot user right when the edit was made).
It is especially reliable because the data is stored with the edit, not calculated afterwards (so it is regardless of the user's group memberships at time of query).
However it has one catch: The data is only stored in the recentchanges table, from which it expires after 30 days. I guess this explains why the best way is also the least common way to categorize bot edits in analytics (unless only covering recent data).
There is an open bug to store the bot flag in the revision table, thus making it permanently available[2].
-- Krinkle
[1] For example on Commons, where I am a sysop, there is a bot I ran that edits sysop-protected pages, therefore I had to run this bot under my personal account for a while, marking its edits as "bot". Most sysop-bots (including mine now) have a separate account which is then given membership to the "sysop" user group, but this isn't always the case. For example on Wikipedia I know there's various admins that use software to automatically block certain IP-addresses from time to time (proxies, TOR, zombies, whatever). Some are ran on bot accounts, some not.
[2] https://bugzilla.wikimedia.org/show_bug.cgi?id=17237
On May 25, 2012, at 7:49 PM, Fabian Kaelin fkaelin@wikimedia.org wrote:
Hi,
Sorry about the length of this mail, it reads faster than it looks.
I am working with the recentchanges and the cu_changes (checkuser) mediawiki SQL tables. I would like to be able to filter bot activity, unfortunately I am increasingly confused.
Things that I think I know:
- In the recentchangeshttp://www.mediawiki.org/wiki/Manual:Recentchanges_table
table there is a `rc_bot` flag that should indicate whether the edit comes from a bot.
- The checkuser table
cu_changeshttp://www.mediawiki.org/wiki/Extension:CheckUser (which is not documented on the mediawiki database layout pagehttp://www.mediawiki.org/wiki/Manual:Database_layout) contains mostly the same information as the recentchanges table but for a longer period of time. However, there is no bot flag as there is on the recentchanges table - I don't know why not.
- There is a `bot` entry in the
user_groups.ug_grouphttp://www.mediawiki.org/wiki/Manual:User_groups_table field. A revision/recentchanges/cu_changes entry can be identified as bot by joining the original table with user_groups on the user_id and by setting ug_group=`bot`.
- The user_groups method way of identifying bots is inefficient and the
data seems incomplete. For some other projects we have used various other bot tables created by hand (on db1047: halfak.bot used during WSOR 2011 or declerambaul.erik_bots containing the bots identified by Erik Zachte).
I would like to know the answers to the following questions:
- *What is the meaning/purpose of the rc_bot flag on recentchanges? *There
are entries in the recentchanges table from editors that are flagged as bots in the user_groups and the other bot tables but still have the rc_bot flag set to 0.
mysql> select rc.rc_user_text from recentchanges rc join user_groups ug ON (rc.rc_user=ug.ug_user) WHERE ug.ug_group = 'bot' and rc.rc_bot=0 limit 1; +--------------+ | rc_user_text | +--------------+ | ClueBot NG | +--------------+
- *Why is there no bot flag in the checkuser table? *A lot of the other
fields seem to be copied from the recentchanges table, why not the rc_bot field? The check user table contains both entries that are flagged as bots in the recentchanges table and entries that are flagged as bots in the user_groups.
mysql> select cuc.cuc_user_text from recentchanges rc join cu_changes cuc ON (rc.rc_user=cuc.cuc_user) WHERE rc.rc_bot=1 limit 1; +---------------+ | cuc_user_text | +---------------+ | MiszaBot III | +---------------+ mysql> select cuc.cuc_user_text from cu_changes cuc join user_groups ug ON (cuc.cuc_user=ug.ug_user) WHERE ug.ug_group = 'bot' limit 1; +---------------+ | cuc_user_text | +---------------+ | Robbot | +---------------+
- *Am I missing some fundamental information about how bots are handled?* This
is a frequently recurring request for data analytics and it seems the data is inconsistent.
What is the most convenient, sane way to classify bot activity as such? Are there any projects underway that aim to improve the situation? Any input, pointers and recommendations are much appreciated.
Thanks a lot! Regards, Fabian _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l