On 16/10/2007, Michael Daly michaeldaly@kayakwiki.org wrote:
2007@gmask.com wrote:
This is what is happening to me as well.. but the inserted words are allways at the beginning of the page which gives me hope in blocking these types of bot edits with a regex.
I was thinking that this could be checked against a dictionary. If the first "word" inserted is not in the dictionary (for the page's language), require the user to confirm the save. A bot won't confirm.
This would have to be smart enough to skip wikitext (e.g. don't worry about "[[Image:"). Similarly, it would choke on obscure acronyms, but a real person would not likely complain too much.
This could be a hook into the "save" code and only need check for the first word. However, the bot writer can switch to posting at the end of the article... Possibly, a scan of the entire page to reject exceptionally bad spelling might suffice, but will put off some contributers (and annoy US vs Canadian vs British spellers if the bad spelling algorithm isn't smart enough to think honour vs honor isn't that bad).
So; 1) We are all seeing the same kind of spam. 2) We need something that looks at the whole edit, and isn't based on some trivial aspect of the particular spam attack (that could easily be changed). 3) We need something that goes beyond an 'are you a human captcha' - because such tests are either too infrequent to be useful or too common to be tenable.
4) What is wrong with a Bayesian (email style) spam filter?
Each edit gets certain attributes set - username and email or IP address, number of good edits from this user, edit frequency of this user, edit diff text, etc. - and then the Bayesian filter flags the edit with a 'level of spamminess'. Depending on configuration spammy edits can be flat out rejected with multiple spams leading to automatic bans. Or potential spam can be queued in a special list of edits for review (the review process being key to learning the patterns of spam). Such a filter could equally be applied to vandalism... Also (while I am at it) sysops will have the option to 'mark edit as spam', providing more data for the training algorithm.
So there is only one problem... Were should we start?
Some Googling for PHP code to nick looks promising...
http://www.phpclasses.org/browse/file/9319.html Guestbook Example with SpamFilter http://www.squirrelmail.org/plugin_view.php?id=115 uses a Bayesian algorithm to determine what you consider to be spam.