The bot flag is rather confusing since edits can be marked as bot edits
or non-bot edits regardless of whether they are actually coming from a
bot. The edit API accepts a 'bot' or 'notbot' parameter and I believe
this is what actually gets recorded in the recentchanges table.
Obviously this isn't very helpful for getting reliable analytics on bot
edits :(
Ryan Kaldari
On 5/25/12 10:49 AM, Fabian Kaelin wrote:
Hi,
Sorry about the length of this mail, it reads faster than it looks.
I am working with the recentchanges and the cu_changes (checkuser)
mediawiki SQL tables. I would like to be able to filter bot activity,
unfortunately I am increasingly confused.
Things that I think I know:
- In the
recentchanges<http://www.mediawiki.org/wiki/Manual:Recentchanges_table&g…
table
there is a `rc_bot` flag that should indicate whether the edit comes from a
bot.
- The checkuser table
cu_changes<http://www.mediawiki.org/wiki/Extension:CheckUser> (which
is not documented on the mediawiki database layout
page<http://www.mediawiki.org/wiki/Manual:Database_layout>)
contains mostly the same information as the recentchanges table but for a
longer period of time. However, there is no bot flag as there is on the
recentchanges table - I don't know why not.
- There is a `bot` entry in the
user_groups.ug_group<http://www.mediawiki.org/wiki/Manual:User_groups_ta…
field.
A revision/recentchanges/cu_changes entry can be identified as bot by
joining the original table with user_groups on the user_id and by setting
ug_group=`bot`.
- The user_groups method way of identifying bots is inefficient and the
data seems incomplete. For some other projects we have used various other
bot tables created by hand (on db1047: halfak.bot used during WSOR 2011 or
declerambaul.erik_bots containing the bots identified by Erik Zachte).
I would like to know the answers to the following questions:
1. *What is the meaning/purpose of the rc_bot flag on recentchanges? *There
are entries in the recentchanges table from editors that are flagged as
bots in the user_groups and the other bot tables but still have the rc_bot
flag set to 0.
mysql> select rc.rc_user_text from recentchanges rc join user_groups ug ON
(rc.rc_user=ug.ug_user) WHERE ug.ug_group = 'bot' and rc.rc_bot=0 limit 1;
+--------------+
| rc_user_text |
+--------------+
| ClueBot NG |
+--------------+
2. *Why is there no bot flag in the checkuser table? *A lot of the other
fields seem to be copied from the recentchanges table, why not the rc_bot
field? The check user table contains both entries that are flagged as bots
in the recentchanges table and entries that are flagged as bots in the
user_groups.
mysql> select cuc.cuc_user_text from recentchanges rc join cu_changes cuc
ON (rc.rc_user=cuc.cuc_user) WHERE rc.rc_bot=1 limit 1;
+---------------+
| cuc_user_text |
+---------------+
| MiszaBot III |
+---------------+
mysql> select cuc.cuc_user_text from cu_changes cuc join user_groups ug ON
(cuc.cuc_user=ug.ug_user) WHERE ug.ug_group = 'bot' limit 1;
+---------------+
| cuc_user_text |
+---------------+
| Robbot |
+---------------+
3. *Am I missing some fundamental information about how bots are handled?* This
is a frequently recurring request for data analytics and it seems the data
is inconsistent.
What is the most convenient, sane way to classify bot activity as such? Are
there any projects underway that aim to improve the situation? Any input,
pointers and recommendations are much appreciated.
Thanks a lot! Regards,
Fabian
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l