Hi,
Sorry about the length of this mail, it reads faster than it looks.
I am working with the recentchanges and the cu_changes (checkuser) mediawiki SQL tables. I would like to be able to filter bot activity, unfortunately I am increasingly confused.
Things that I think I know:
- In the recentchangeshttp://www.mediawiki.org/wiki/Manual:Recentchanges_table table there is a `rc_bot` flag that should indicate whether the edit comes from a bot. - The checkuser table cu_changeshttp://www.mediawiki.org/wiki/Extension:CheckUser (which is not documented on the mediawiki database layout pagehttp://www.mediawiki.org/wiki/Manual:Database_layout) contains mostly the same information as the recentchanges table but for a longer period of time. However, there is no bot flag as there is on the recentchanges table - I don't know why not. - There is a `bot` entry in the user_groups.ug_grouphttp://www.mediawiki.org/wiki/Manual:User_groups_table field. A revision/recentchanges/cu_changes entry can be identified as bot by joining the original table with user_groups on the user_id and by setting ug_group=`bot`. - The user_groups method way of identifying bots is inefficient and the data seems incomplete. For some other projects we have used various other bot tables created by hand (on db1047: halfak.bot used during WSOR 2011 or declerambaul.erik_bots containing the bots identified by Erik Zachte).
I would like to know the answers to the following questions:
1. *What is the meaning/purpose of the rc_bot flag on recentchanges? *There are entries in the recentchanges table from editors that are flagged as bots in the user_groups and the other bot tables but still have the rc_bot flag set to 0.
mysql> select rc.rc_user_text from recentchanges rc join user_groups ug ON (rc.rc_user=ug.ug_user) WHERE ug.ug_group = 'bot' and rc.rc_bot=0 limit 1; +--------------+ | rc_user_text | +--------------+ | ClueBot NG | +--------------+
2. *Why is there no bot flag in the checkuser table? *A lot of the other fields seem to be copied from the recentchanges table, why not the rc_bot field? The check user table contains both entries that are flagged as bots in the recentchanges table and entries that are flagged as bots in the user_groups.
mysql> select cuc.cuc_user_text from recentchanges rc join cu_changes cuc ON (rc.rc_user=cuc.cuc_user) WHERE rc.rc_bot=1 limit 1; +---------------+ | cuc_user_text | +---------------+ | MiszaBot III | +---------------+ mysql> select cuc.cuc_user_text from cu_changes cuc join user_groups ug ON (cuc.cuc_user=ug.ug_user) WHERE ug.ug_group = 'bot' limit 1; +---------------+ | cuc_user_text | +---------------+ | Robbot | +---------------+
3. *Am I missing some fundamental information about how bots are handled?* This is a frequently recurring request for data analytics and it seems the data is inconsistent.
What is the most convenient, sane way to classify bot activity as such? Are there any projects underway that aim to improve the situation? Any input, pointers and recommendations are much appreciated.
Thanks a lot! Regards, Fabian