Hi,
Sorry about the length of this mail, it reads faster than it looks.
I am working with the recentchanges and the cu_changes (checkuser) mediawiki SQL tables. I would like to be able to filter bot activity, unfortunately I am increasingly confused.
Things that I think I know:
- In the recentchangeshttp://www.mediawiki.org/wiki/Manual:Recentchanges_table table there is a `rc_bot` flag that should indicate whether the edit comes from a bot. - The checkuser table cu_changeshttp://www.mediawiki.org/wiki/Extension:CheckUser (which is not documented on the mediawiki database layout pagehttp://www.mediawiki.org/wiki/Manual:Database_layout) contains mostly the same information as the recentchanges table but for a longer period of time. However, there is no bot flag as there is on the recentchanges table - I don't know why not. - There is a `bot` entry in the user_groups.ug_grouphttp://www.mediawiki.org/wiki/Manual:User_groups_table field. A revision/recentchanges/cu_changes entry can be identified as bot by joining the original table with user_groups on the user_id and by setting ug_group=`bot`. - The user_groups method way of identifying bots is inefficient and the data seems incomplete. For some other projects we have used various other bot tables created by hand (on db1047: halfak.bot used during WSOR 2011 or declerambaul.erik_bots containing the bots identified by Erik Zachte).
I would like to know the answers to the following questions:
1. *What is the meaning/purpose of the rc_bot flag on recentchanges? *There are entries in the recentchanges table from editors that are flagged as bots in the user_groups and the other bot tables but still have the rc_bot flag set to 0.
mysql> select rc.rc_user_text from recentchanges rc join user_groups ug ON (rc.rc_user=ug.ug_user) WHERE ug.ug_group = 'bot' and rc.rc_bot=0 limit 1; +--------------+ | rc_user_text | +--------------+ | ClueBot NG | +--------------+
2. *Why is there no bot flag in the checkuser table? *A lot of the other fields seem to be copied from the recentchanges table, why not the rc_bot field? The check user table contains both entries that are flagged as bots in the recentchanges table and entries that are flagged as bots in the user_groups.
mysql> select cuc.cuc_user_text from recentchanges rc join cu_changes cuc ON (rc.rc_user=cuc.cuc_user) WHERE rc.rc_bot=1 limit 1; +---------------+ | cuc_user_text | +---------------+ | MiszaBot III | +---------------+ mysql> select cuc.cuc_user_text from cu_changes cuc join user_groups ug ON (cuc.cuc_user=ug.ug_user) WHERE ug.ug_group = 'bot' limit 1; +---------------+ | cuc_user_text | +---------------+ | Robbot | +---------------+
3. *Am I missing some fundamental information about how bots are handled?* This is a frequently recurring request for data analytics and it seems the data is inconsistent.
What is the most convenient, sane way to classify bot activity as such? Are there any projects underway that aim to improve the situation? Any input, pointers and recommendations are much appreciated.
Thanks a lot! Regards, Fabian
2012/5/25 Fabian Kaelin fkaelin@wikimedia.org:
- *What is the meaning/purpose of the rc_bot flag on recentchanges? *There
are entries in the recentchanges table from editors that are flagged as bots in the user_groups and the other bot tables but still have the rc_bot flag set to 0.
I don't know about the rest, but: * it is possible for any admin to do reverts which are considered bot changes (in order not to clutter the RecentChanges), just add "&bot=1" to the end of any URL where the pages contains rollback links, such as contribs or page history * I think that it is also possible for a bot to make its edit as not-bot, for changes which should be by default visible on RC – apparently ClueBot (en.wiki automatic vandalism revert bot) does this
The rc_bot flag just means that the edit is by default hidden on RC and can be shown using "show bots" link, nothing else.
-- Matma Rex
The bot flag is rather confusing since edits can be marked as bot edits or non-bot edits regardless of whether they are actually coming from a bot. The edit API accepts a 'bot' or 'notbot' parameter and I believe this is what actually gets recorded in the recentchanges table. Obviously this isn't very helpful for getting reliable analytics on bot edits :(
Ryan Kaldari
On 5/25/12 10:49 AM, Fabian Kaelin wrote:
Hi,
Sorry about the length of this mail, it reads faster than it looks.
I am working with the recentchanges and the cu_changes (checkuser) mediawiki SQL tables. I would like to be able to filter bot activity, unfortunately I am increasingly confused.
Things that I think I know:
- In the recentchanges<http://www.mediawiki.org/wiki/Manual:Recentchanges_table>
table there is a `rc_bot` flag that should indicate whether the edit comes from a bot. - The checkuser table cu_changeshttp://www.mediawiki.org/wiki/Extension:CheckUser (which is not documented on the mediawiki database layout pagehttp://www.mediawiki.org/wiki/Manual:Database_layout) contains mostly the same information as the recentchanges table but for a longer period of time. However, there is no bot flag as there is on the recentchanges table - I don't know why not. - There is a `bot` entry in the user_groups.ug_grouphttp://www.mediawiki.org/wiki/Manual:User_groups_table field. A revision/recentchanges/cu_changes entry can be identified as bot by joining the original table with user_groups on the user_id and by setting ug_group=`bot`. - The user_groups method way of identifying bots is inefficient and the data seems incomplete. For some other projects we have used various other bot tables created by hand (on db1047: halfak.bot used during WSOR 2011 or declerambaul.erik_bots containing the bots identified by Erik Zachte).
I would like to know the answers to the following questions:
- *What is the meaning/purpose of the rc_bot flag on recentchanges? *There
are entries in the recentchanges table from editors that are flagged as bots in the user_groups and the other bot tables but still have the rc_bot flag set to 0.
mysql> select rc.rc_user_text from recentchanges rc join user_groups ug ON (rc.rc_user=ug.ug_user) WHERE ug.ug_group = 'bot' and rc.rc_bot=0 limit 1; +--------------+ | rc_user_text | +--------------+ | ClueBot NG | +--------------+
- *Why is there no bot flag in the checkuser table? *A lot of the other
fields seem to be copied from the recentchanges table, why not the rc_bot field? The check user table contains both entries that are flagged as bots in the recentchanges table and entries that are flagged as bots in the user_groups.
mysql> select cuc.cuc_user_text from recentchanges rc join cu_changes cuc ON (rc.rc_user=cuc.cuc_user) WHERE rc.rc_bot=1 limit 1; +---------------+ | cuc_user_text | +---------------+ | MiszaBot III | +---------------+ mysql> select cuc.cuc_user_text from cu_changes cuc join user_groups ug ON (cuc.cuc_user=ug.ug_user) WHERE ug.ug_group = 'bot' limit 1; +---------------+ | cuc_user_text | +---------------+ | Robbot | +---------------+
- *Am I missing some fundamental information about how bots are handled?* This
is a frequently recurring request for data analytics and it seems the data is inconsistent.
What is the most convenient, sane way to classify bot activity as such? Are there any projects underway that aim to improve the situation? Any input, pointers and recommendations are much appreciated.
Thanks a lot! Regards, Fabian _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Fri, May 25, 2012 at 11:40 AM, Ryan Kaldari rkaldari@wikimedia.org wrote:
The bot flag is rather confusing since edits can be marked as bot edits or non-bot edits regardless of whether they are actually coming from a bot. The edit API accepts a 'bot' or 'notbot' parameter and I believe this is what actually gets recorded in the recentchanges table. Obviously this isn't very helpful for getting reliable analytics on bot edits :(
This is not entirely true: you need to have the bot right in order to mark your edits as bot edits. Typically, only users in the bot group would have the bot right.
Roan
The cu_changes table is not at http://www.mediawiki.org/wiki/Manual:Database_layout%3E because it's provided by an extension.
The rc_bot marks if it's an edit marked as bot. As you have discovered, a bot can make edits not marked as bot (furthermore, as it is only stored in recentchanges table, whether an edit was originally marked as bot is lost after the entry expires, see bug 11181).
As you want to filter out bot activity, you will want joining with usergroups. As a bot account should not perform any "manual" activity, that's a pretty good indicator. OTOH, you may well discover bot operators ocasionally running a bot under their own account, and also bot accounts which weren't given a flag (eg. some patrol bots).
The presence of 'bot' (case insensitive) in the user name is a good hint to treat as bot, but again, there are a few humans with that substring, and a few bots without it.
Finally, are you using in any way the special fields of the checkuser table? As all that data is available in other tables, it may not be appropiate to use access to that restricted information for anything not requiring it.
Regards
There are some bots without flag, but usergroups is a good starting point. As far as i know, only a few counter vandalisms or notifications bots aren't marked as bots. Like Matma said, bots can have the flag and edit wothout having the edits marked has bot edits. For example, my counter vandalism bot reverts vandalisms without the bot flag, but warns users with it.
Search for "Bot" on summary can be also an indicator of an automatic edition, as far has i know, at least on major projects bots must have that indication on summary, but may also fetch false positives for totaly automatic editions, since some scripts act has a semi-automatic tool.
Alchimista
2012/5/25 Platonides Platonides@gmail.com
The cu_changes table is not at http://www.mediawiki.org/wiki/Manual:Database_layout%3E because it's provided by an extension.
The rc_bot marks if it's an edit marked as bot. As you have discovered, a bot can make edits not marked as bot (furthermore, as it is only stored in recentchanges table, whether an edit was originally marked as bot is lost after the entry expires, see bug 11181).
As you want to filter out bot activity, you will want joining with usergroups. As a bot account should not perform any "manual" activity, that's a pretty good indicator. OTOH, you may well discover bot operators ocasionally running a bot under their own account, and also bot accounts which weren't given a flag (eg. some patrol bots).
The presence of 'bot' (case insensitive) in the user name is a good hint to treat as bot, but again, there are a few humans with that substring, and a few bots without it.
Finally, are you using in any way the special fields of the checkuser table? As all that data is available in other tables, it may not be appropiate to use access to that restricted information for anything not requiring it.
Regards
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Fabian, you have found one of my old problems. There is a request on Bugzilla to store bot edits in a table other than recent changes in order to be able to sort them out of page histories. As the others wrote, this is a very complicated question with few hope to get exact results, but please let me know if you have some acceptable solution.
2012/5/26 Platonides Platonides@gmail.com
As a bot account should not perform any "manual" activity, that's a pretty good indicator.
Even this is not 100% true. I sometimes use my bot for mass text replacements witihin one article as this is the cheapest way but this is intended to be a single edit while I am working on that page manually.
On Fri, 25 May 2012 22:27:20 -0700, Bináris wikiposta@gmail.com wrote:
Fabian, you have found one of my old problems. There is a request on Bugzilla to store bot edits in a table other than recent changes in order to be able to sort them out of page histories. As the others wrote, this is a very complicated question with few hope to get exact results, but please let me know if you have some acceptable solution.
Did you file a bug for that? I'd like to watch that myself.
2012/5/26 Platonides Platonides@gmail.com
As a bot account should not perform any "manual" activity, that's a pretty good indicator.
Even this is not 100% true. I sometimes use my bot for mass text replacements witihin one article as this is the cheapest way but this is intended to be a single edit while I am working on that page manually.
There is 3 different "bot" thingies you should know about, I'll briefly describe them each:
== The "bot" user right:
This is the right that grants the user the ability to perform an edit with a "bot" flag.
Facts: * Not all users with this right are "bots". * The flag can be toggled on a per-edit basis. Bot software will activate this flag. But an account can be used by humans and bot software simultaneously. Dedicated bot accounts will typically have all its edits bot-flagged, but other users may contribute regularly and also run a bot from time to time with their credentials[1].
== The "bot" user group:
This user group is available in MediaWiki by default to make it possible to grant a user the bot right (because user management goes by groups, not rights. To grant a user the bot right, one adds the user to a group that provides that right).
Facts: * Group membership can change over time. There are many bot-flagged edits by users that are no longer in a user group providing the bot right. Likewise there are many edits not bot-flagged by users that now have the bot right (which they may or may not use for each edit). * Not all bots are a member of this group (there are other groups that provide this right, sysop, for example).
== The "bot" flag:
This is the only reliable factor. This indicates most accurately that the edit was intended as bot edit (and that the user could do so because they had the bot user right when the edit was made).
It is especially reliable because the data is stored with the edit, not calculated afterwards (so it is regardless of the user's group memberships at time of query).
However it has one catch: The data is only stored in the recentchanges table, from which it expires after 30 days. I guess this explains why the best way is also the least common way to categorize bot edits in analytics (unless only covering recent data).
There is an open bug to store the bot flag in the revision table, thus making it permanently available[2].
-- Krinkle
[1] For example on Commons, where I am a sysop, there is a bot I ran that edits sysop-protected pages, therefore I had to run this bot under my personal account for a while, marking its edits as "bot". Most sysop-bots (including mine now) have a separate account which is then given membership to the "sysop" user group, but this isn't always the case. For example on Wikipedia I know there's various admins that use software to automatically block certain IP-addresses from time to time (proxies, TOR, zombies, whatever). Some are ran on bot accounts, some not.
[2] https://bugzilla.wikimedia.org/show_bug.cgi?id=17237
On May 25, 2012, at 7:49 PM, Fabian Kaelin fkaelin@wikimedia.org wrote:
Hi,
Sorry about the length of this mail, it reads faster than it looks.
I am working with the recentchanges and the cu_changes (checkuser) mediawiki SQL tables. I would like to be able to filter bot activity, unfortunately I am increasingly confused.
Things that I think I know:
- In the recentchangeshttp://www.mediawiki.org/wiki/Manual:Recentchanges_table
table there is a `rc_bot` flag that should indicate whether the edit comes from a bot.
- The checkuser table
cu_changeshttp://www.mediawiki.org/wiki/Extension:CheckUser (which is not documented on the mediawiki database layout pagehttp://www.mediawiki.org/wiki/Manual:Database_layout) contains mostly the same information as the recentchanges table but for a longer period of time. However, there is no bot flag as there is on the recentchanges table - I don't know why not.
- There is a `bot` entry in the
user_groups.ug_grouphttp://www.mediawiki.org/wiki/Manual:User_groups_table field. A revision/recentchanges/cu_changes entry can be identified as bot by joining the original table with user_groups on the user_id and by setting ug_group=`bot`.
- The user_groups method way of identifying bots is inefficient and the
data seems incomplete. For some other projects we have used various other bot tables created by hand (on db1047: halfak.bot used during WSOR 2011 or declerambaul.erik_bots containing the bots identified by Erik Zachte).
I would like to know the answers to the following questions:
- *What is the meaning/purpose of the rc_bot flag on recentchanges? *There
are entries in the recentchanges table from editors that are flagged as bots in the user_groups and the other bot tables but still have the rc_bot flag set to 0.
mysql> select rc.rc_user_text from recentchanges rc join user_groups ug ON (rc.rc_user=ug.ug_user) WHERE ug.ug_group = 'bot' and rc.rc_bot=0 limit 1; +--------------+ | rc_user_text | +--------------+ | ClueBot NG | +--------------+
- *Why is there no bot flag in the checkuser table? *A lot of the other
fields seem to be copied from the recentchanges table, why not the rc_bot field? The check user table contains both entries that are flagged as bots in the recentchanges table and entries that are flagged as bots in the user_groups.
mysql> select cuc.cuc_user_text from recentchanges rc join cu_changes cuc ON (rc.rc_user=cuc.cuc_user) WHERE rc.rc_bot=1 limit 1; +---------------+ | cuc_user_text | +---------------+ | MiszaBot III | +---------------+ mysql> select cuc.cuc_user_text from cu_changes cuc join user_groups ug ON (cuc.cuc_user=ug.ug_user) WHERE ug.ug_group = 'bot' limit 1; +---------------+ | cuc_user_text | +---------------+ | Robbot | +---------------+
- *Am I missing some fundamental information about how bots are handled?* This
is a frequently recurring request for data analytics and it seems the data is inconsistent.
What is the most convenient, sane way to classify bot activity as such? Are there any projects underway that aim to improve the situation? Any input, pointers and recommendations are much appreciated.
Thanks a lot! Regards, Fabian _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
wikitech-l@lists.wikimedia.org