The bot flag is rather confusing since edits can be marked as bot edits or non-bot edits regardless of whether they are actually coming from a bot. The edit API accepts a 'bot' or 'notbot' parameter and I believe this is what actually gets recorded in the recentchanges table. Obviously this isn't very helpful for getting reliable analytics on bot edits :(
Ryan Kaldari
On 5/25/12 10:49 AM, Fabian Kaelin wrote:
Hi,
Sorry about the length of this mail, it reads faster than it looks.
I am working with the recentchanges and the cu_changes (checkuser) mediawiki SQL tables. I would like to be able to filter bot activity, unfortunately I am increasingly confused.
Things that I think I know:
- In the recentchanges<http://www.mediawiki.org/wiki/Manual:Recentchanges_table>
table there is a `rc_bot` flag that should indicate whether the edit comes from a bot. - The checkuser table cu_changeshttp://www.mediawiki.org/wiki/Extension:CheckUser (which is not documented on the mediawiki database layout pagehttp://www.mediawiki.org/wiki/Manual:Database_layout) contains mostly the same information as the recentchanges table but for a longer period of time. However, there is no bot flag as there is on the recentchanges table - I don't know why not. - There is a `bot` entry in the user_groups.ug_grouphttp://www.mediawiki.org/wiki/Manual:User_groups_table field. A revision/recentchanges/cu_changes entry can be identified as bot by joining the original table with user_groups on the user_id and by setting ug_group=`bot`. - The user_groups method way of identifying bots is inefficient and the data seems incomplete. For some other projects we have used various other bot tables created by hand (on db1047: halfak.bot used during WSOR 2011 or declerambaul.erik_bots containing the bots identified by Erik Zachte).
I would like to know the answers to the following questions:
- *What is the meaning/purpose of the rc_bot flag on recentchanges? *There
are entries in the recentchanges table from editors that are flagged as bots in the user_groups and the other bot tables but still have the rc_bot flag set to 0.
mysql> select rc.rc_user_text from recentchanges rc join user_groups ug ON (rc.rc_user=ug.ug_user) WHERE ug.ug_group = 'bot' and rc.rc_bot=0 limit 1; +--------------+ | rc_user_text | +--------------+ | ClueBot NG | +--------------+
- *Why is there no bot flag in the checkuser table? *A lot of the other
fields seem to be copied from the recentchanges table, why not the rc_bot field? The check user table contains both entries that are flagged as bots in the recentchanges table and entries that are flagged as bots in the user_groups.
mysql> select cuc.cuc_user_text from recentchanges rc join cu_changes cuc ON (rc.rc_user=cuc.cuc_user) WHERE rc.rc_bot=1 limit 1; +---------------+ | cuc_user_text | +---------------+ | MiszaBot III | +---------------+ mysql> select cuc.cuc_user_text from cu_changes cuc join user_groups ug ON (cuc.cuc_user=ug.ug_user) WHERE ug.ug_group = 'bot' limit 1; +---------------+ | cuc_user_text | +---------------+ | Robbot | +---------------+
- *Am I missing some fundamental information about how bots are handled?* This
is a frequently recurring request for data analytics and it seems the data is inconsistent.
What is the most convenient, sane way to classify bot activity as such? Are there any projects underway that aim to improve the situation? Any input, pointers and recommendations are much appreciated.
Thanks a lot! Regards, Fabian _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l