Spamfilters (was: Re: [Wikipedia-l] Methods of protection of minor Wikipedias)

5 Nov 2005

Mark Williamson wrote:
...
     Blocking
"that-medicine-that-starts-with-c" will
 prevent anyone writing about socialism (which was rather a problem for
 socialism.wikicities.com) :) 
 See-eye-a-ell-eye-ess is related to _socialism_?? 
 At the risk of ending up in everyone's spam bins, I'll spell it out:
 "so...Cialis...m". Blocking the word blocks any words that contain it.  
 Ohh. Duh. But, surely, it would take only a few lines of code to add a
 feature so that it only blocked the _whole word_? 
This kind of spamfiltering doesn't really work. Spammers will write
CCialiss, __cialis__, "C1al1s", etc. To properly fight spam one needs a
bayesian spamfilter. If edits get flagged as spam or non-spam, a
database can be built up that allows new edits to be compared with them.
These will then get a 'spam chance' P_s flag, and we could define a
treshhold P_t where P_s>P_t prevents an edit from getting through.
The regular expression 'c[i1][a@][l1][i1]s' has 14 hits in my hammie.db
database for Bayesian Spamfiltering using Spambayes, and I'm sure I've
missed some.

...
  I mean, does anybody get spam e-mails that say
"Free socialism! Click
 here now"... or even "Get free soCialiSm! cl**k here no*" or anything
 like that? I don't think spammers are sophisticated enough to realise
 that there are legitimate words that contain spam-filter'd words. 
No, but they do replace letters by characters or introduce spaces in
between.

...
  Of course, anything that filtered on something as
complex as this
 would be very, very complex programming. 
Not really.

...
  Perhaps instead, somebody could adapt a Free numerical
rating system
 for spam e-mails (which gives "likelyhoods" that e-mails are spam) --
 Google may or may not be willing to help out there given how massive
 their database must be and their commitment to Goodness on the
 Internet, but if not there would be another project I'm sure. 
The good thing about bayesian spamfiltering is that the database is
suited to the own need, and the accuracy grows very very quickly as the
database gets larger.

...
  I'm going into too much detail here, and obviously
it would be a
 massive undertaking, but given the massive amount of work it would
 solve, it's not the sort of pipe dream that I feel guilty bringing up
 in front of people who could actually bring it to fruition (I know I
 couldn't without learning a programming language first -- right now, I
 have very rusty Qbasic, medium-to-advanced HTML, a bit of UNL, but
 nothing else, and the latter two aren't exaclty programming
 languages). 
Let us have a look at components of Spambayes. Those can certainly be
used and suited to our task. As tokens we can use IP-addresses, numbers
indicating the amount of code removed (needs some more thinking),
negative points when text is removed/positive when it's added (e.g.
*removing* 'cialis' has the opposite effect as *adding* it), etc.

I can help with adopting Spambayes or using Spambayes components for our
needs. I am not an expert, but I know some.

Gerrit.

-- 
Temperature in Luleå, Norrbotten, Sweden:
| Current temperature   05-11-05 09:39:55    8.1 degrees Celsius ( 46.7F) |
-- 
Det finns inte dåligt väder, bara dåliga kläder.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

Spamfilters (was: Re: [Wikipedia-l] Methods of protection of minor Wikipedias)