Re: [Mediawiki-l] jibberish

17 Oct 2007

On 16/10/2007, Michael Daly &lt;michaeldaly(a)kayakwiki.org&gt; wrote:
...
  2007(a)gmask.com wrote:

  This is what is happening to me as well.. but the
inserted words are
 allways at the beginning of the page which gives me hope in blocking
 these types of bot edits with a regex. 
 I was thinking that this could be checked against a dictionary.  If the
 first "word" inserted is not in the dictionary (for the page's
 language), require the user to confirm the save.  A bot won't confirm.

 This would have to be smart enough to skip wikitext (e.g. don't worry
 about "[[Image:").  Similarly, it would choke on obscure acronyms, but a
 real person would not likely complain too much.

 This could be a hook into the "save" code and only need check for the
 first word.  However, the bot writer can switch to posting at the end of
 the article...  Possibly, a scan of the entire page to reject
 exceptionally bad spelling might suffice, but will put off some
 contributers (and annoy US vs Canadian vs British spellers if the bad
 spelling algorithm isn't smart enough to think honour vs honor isn't
 that bad). 
So; 1) We are all seeing the same kind of spam. 2) We need something
that looks at the whole edit, and isn't based on some trivial aspect
of the particular spam attack (that could easily be changed). 3) We
need something that goes beyond an 'are you a human captcha' - because
such tests are either too infrequent to be useful or too common to be
tenable.

4) What is wrong with a Bayesian (email style) spam filter?

Each edit gets certain attributes set - username and email or IP
address, number of good edits from this user, edit frequency of this
user, edit diff text, etc. - and then the Bayesian filter flags the
edit with a 'level of spamminess'. Depending on configuration spammy
edits can be flat out rejected with multiple spams leading to
automatic bans. Or potential spam can be queued in a special list of
edits for review (the review process being key to learning the
patterns of spam). Such a filter could equally be applied to
vandalism... Also (while I am at it) sysops will have the option to
'mark edit as spam', providing more data for the training algorithm.

So there is only one problem... Were should we start?

Some Googling for PHP code to nick looks promising...

http://www.phpclasses.org/browse/file/9319.html Guestbook Example with
SpamFilter
http://www.squirrelmail.org/plugin_view.php?id=115 uses a Bayesian
algorithm to determine what you consider to be spam.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

Re: [Mediawiki-l] jibberish