Mark Williamson wrote:
Blocking "that-medicine-that-starts-with-c" will prevent anyone writing about socialism (which was rather a problem for socialism.wikicities.com) :)
See-eye-a-ell-eye-ess is related to _socialism_??
At the risk of ending up in everyone's spam bins, I'll spell it out: "so...Cialis...m". Blocking the word blocks any words that contain it.
Ohh. Duh. But, surely, it would take only a few lines of code to add a feature so that it only blocked the _whole word_?
This kind of spamfiltering doesn't really work. Spammers will write CCialiss, __cialis__, "C1al1s", etc. To properly fight spam one needs a bayesian spamfilter. If edits get flagged as spam or non-spam, a database can be built up that allows new edits to be compared with them. These will then get a 'spam chance' P_s flag, and we could define a treshhold P_t where P_s>P_t prevents an edit from getting through. The regular expression 'c[i1][a@][l1][i1]s' has 14 hits in my hammie.db database for Bayesian Spamfiltering using Spambayes, and I'm sure I've missed some.
I mean, does anybody get spam e-mails that say "Free socialism! Click here now"... or even "Get free soCialiSm! cl**k here no*" or anything like that? I don't think spammers are sophisticated enough to realise that there are legitimate words that contain spam-filter'd words.
No, but they do replace letters by characters or introduce spaces in between.
Of course, anything that filtered on something as complex as this would be very, very complex programming.
Not really.
Perhaps instead, somebody could adapt a Free numerical rating system for spam e-mails (which gives "likelyhoods" that e-mails are spam) -- Google may or may not be willing to help out there given how massive their database must be and their commitment to Goodness on the Internet, but if not there would be another project I'm sure.
The good thing about bayesian spamfiltering is that the database is suited to the own need, and the accuracy grows very very quickly as the database gets larger.
I'm going into too much detail here, and obviously it would be a massive undertaking, but given the massive amount of work it would solve, it's not the sort of pipe dream that I feel guilty bringing up in front of people who could actually bring it to fruition (I know I couldn't without learning a programming language first -- right now, I have very rusty Qbasic, medium-to-advanced HTML, a bit of UNL, but nothing else, and the latter two aren't exaclty programming languages).
Let us have a look at components of Spambayes. Those can certainly be used and suited to our task. As tokens we can use IP-addresses, numbers indicating the amount of code removed (needs some more thinking), negative points when text is removed/positive when it's added (e.g. *removing* 'cialis' has the opposite effect as *adding* it), etc.
I can help with adopting Spambayes or using Spambayes components for our needs. I am not an expert, but I know some.
Gerrit.