On 16/10/2007, Michael Daly <michaeldaly(a)kayakwiki.org> wrote:
2007(a)gmask.com wrote:
This is what is happening to me as well.. but the
inserted words are
allways at the beginning of the page which gives me hope in blocking
these types of bot edits with a regex.
I was thinking that this could be checked against a dictionary. If the
first "word" inserted is not in the dictionary (for the page's
language), require the user to confirm the save. A bot won't confirm.
This would have to be smart enough to skip wikitext (e.g. don't worry
about "[[Image:"). Similarly, it would choke on obscure acronyms, but a
real person would not likely complain too much.
This could be a hook into the "save" code and only need check for the
first word. However, the bot writer can switch to posting at the end of
the article... Possibly, a scan of the entire page to reject
exceptionally bad spelling might suffice, but will put off some
contributers (and annoy US vs Canadian vs British spellers if the bad
spelling algorithm isn't smart enough to think honour vs honor isn't
that bad).
So; 1) We are all seeing the same kind of spam. 2) We need something
that looks at the whole edit, and isn't based on some trivial aspect
of the particular spam attack (that could easily be changed). 3) We
need something that goes beyond an 'are you a human captcha' - because
such tests are either too infrequent to be useful or too common to be
tenable.
4) What is wrong with a Bayesian (email style) spam filter?
Each edit gets certain attributes set - username and email or IP
address, number of good edits from this user, edit frequency of this
user, edit diff text, etc. - and then the Bayesian filter flags the
edit with a 'level of spamminess'. Depending on configuration spammy
edits can be flat out rejected with multiple spams leading to
automatic bans. Or potential spam can be queued in a special list of
edits for review (the review process being key to learning the
patterns of spam). Such a filter could equally be applied to
vandalism... Also (while I am at it) sysops will have the option to
'mark edit as spam', providing more data for the training algorithm.
So there is only one problem... Were should we start?
Some Googling for PHP code to nick looks promising...
http://www.phpclasses.org/browse/file/9319.html Guestbook Example with
SpamFilter
http://www.squirrelmail.org/plugin_view.php?id=115 uses a Bayesian
algorithm to determine what you consider to be spam.