Smart machine-learning based anti-spam system (I wish!)

List overview All Threads
Download

newer

older

Use cases for Sites handling...

Vitriol on this list

Daniel Friesen

17 Aug 2012 17 Aug '12

12:16 a.m.

I've had a good idea for an anti-spam system for awhile. Blocks, Captchas, and local filters, all the tricks we've been using end up not working well enough to easily deal with the spam on a lot of wikis.

I know this because I've been continually dealing with the spam on a small dead wiki. Simple AntiSpam, AntiBot, Captchas, TorBlock, Abuse Filter... Time after time I expand my filters more and more. But inevitably a few days later spam not covered by my filters comes through and I have to do it again.

I ended up having to deal with it more today and then started writing out the details I've had for awhile on a machine-learning based anti-spam system.

https://www.mediawiki.org/wiki/User:Dantman/Anti-spam_system

Of course. While I have the whole idea for the ui, backend stuff, how to handle the service, etc... I haven't done the actual machine-learning stuff before. Also naturally just like Gareth, OAuth, and other things this is just another one of my ideas I don't have the time and resources to do and wish I had the financial backing to work on.

-- ~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name]

Show replies by date

Chris Steipp

17 Aug 17 Aug

3:59 a.m.

New subject: Smart machine-learning based anti-spam system (I wish!)

Hi Daniel,

A lot of your ideas are covered by http://en.wikipedia.org/wiki/Wikipedia:STiki. Andrew has done a lot of great research, if you haven't read his papers yet that might be a good intro to the type of machine learning approaches that have been used.

That being said, I would love to have some system that is constantly learning from the edits that are flagged as spam, that we can query with new edits from AbuseFilter to get a score of how likely it is that this new edit is spam. If you get around to working on your system, it would be great to work out some way to interface.

On Thu, Aug 16, 2012 at 11:16 AM, Daniel Friesen lists@nadir-seen-fire.com wrote:

...

I've had a good idea for an anti-spam system for awhile. Blocks, Captchas, and local filters, all the tricks we've been using end up not working well enough to easily deal with the spam on a lot of wikis.

I know this because I've been continually dealing with the spam on a small dead wiki. Simple AntiSpam, AntiBot, Captchas, TorBlock, Abuse Filter... Time after time I expand my filters more and more. But inevitably a few days later spam not covered by my filters comes through and I have to do it again.

I ended up having to deal with it more today and then started writing out the details I've had for awhile on a machine-learning based anti-spam system.

https://www.mediawiki.org/wiki/User:Dantman/Anti-spam_system

Of course. While I have the whole idea for the ui, backend stuff, how to handle the service, etc... I haven't done the actual machine-learning stuff before. Also naturally just like Gareth, OAuth, and other things this is just another one of my ideas I don't have the time and resources to do and wish I had the financial backing to work on.

-- ~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name]

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Daniel Friesen

3:36 p.m.

New subject: Smart machine-learning based anti-spam system (I wish!)

Yeah STiki and more importantly ClueBot NG are what I mean when I say "outside of Wikimedia (who already have bots for this)".

I looked into them a bit and planned to ask to look at some of the code if I went along with the project.

-- ~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name] On Thu, 16 Aug 2012 14:59:56 -0700, Chris Steipp csteipp@wikimedia.org wrote: > Hi Daniel, > > A lot of your ideas are covered by > http://en.wikipedia.org/wiki/Wikipedia:STiki. Andrew has done a lot of > great research, if you haven't read his papers yet that might be a > good intro to the type of machine learning approaches that have been > used. > > That being said, I would love to have some system that is constantly > learning from the edits that are flagged as spam, that we can query > with new edits from AbuseFilter to get a score of how likely it is > that this new edit is spam. If you get around to working on your > system, it would be great to work out some way to interface. > > > On Thu, Aug 16, 2012 at 11:16 AM, Daniel Friesen > lists@nadir-seen-fire.com wrote: >> I've had a good idea for an anti-spam system for awhile. >> Blocks, Captchas, and local filters, all the tricks we've been using >> end up >> not working well enough to easily deal with the spam on a lot of wikis. >> >> I know this because I've been continually dealing with the spam on a >> small >> dead wiki. Simple AntiSpam, AntiBot, Captchas, TorBlock, Abuse Filter... >> Time after time I expand my filters more and more. But inevitably a few >> days >> later spam not covered by my filters comes through and I have to do it >> again. >> >> I ended up having to deal with it more today and then started writing >> out >> the details I've had for awhile on a machine-learning based anti-spam >> system. >> >> https://www.mediawiki.org/wiki/User:Dantman/Anti-spam_system >> >> Of course. While I have the whole idea for the ui, backend stuff, how to >> handle the service, etc... I haven't done the actual machine-learning >> stuff >> before. >> Also naturally just like Gareth, OAuth, and other things this is just >> another one of my ideas I don't have the time and resources to do and >> wish I >> had the financial backing to work on. >> >> -- >> ~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name]

Tim Starling

5:50 a.m.

New subject: Smart machine-learning based anti-spam system (I wish!)

On 17/08/12 04:16, Daniel Friesen wrote:

...

Of course. While I have the whole idea for the ui, backend stuff, how to handle the service, etc... I haven't done the actual machine-learning stuff before.

I would think that the actual machine learning stuff would be the hard part. I stopped using Thunderbird's Bayesian spam tagging feature years ago, when it started sorting emails from smart people in with the spam. The computer thought that the smart people were using long words with a similar frequency to the random dictionary words that padded out the spam messages.

I haven't worked with machine learning either, but I'm guessing it's not as simple as feeding a pre-tagged data set into a stock Bayesian filter library.

-- Tim Starling

Daniel Friesen

3:47 p.m.

New subject: Smart machine-learning based anti-spam system (I wish!)

On Thu, 16 Aug 2012 16:50:27 -0700, Tim Starling tstarling@wikimedia.org wrote:

...

On 17/08/12 04:16, Daniel Friesen wrote:

...
Of course. While I have the whole idea for the ui, backend stuff, how to handle the service, etc... I haven't done the actual machine-learning stuff before.

I would think that the actual machine learning stuff would be the hard part. I stopped using Thunderbird's Bayesian spam tagging feature years ago, when it started sorting emails from smart people in with the spam. The computer thought that the smart people were using long words with a similar frequency to the random dictionary words that padded out the spam messages.

I haven't worked with machine learning either, but I'm guessing it's not as simple as feeding a pre-tagged data set into a stock Bayesian filter library.

-- Tim Starling

Yeah, Bayesian is probably too old to use. ClueBot NG appears to be using an Abstract Neural Network [ANN] implementation to do it's spam testing. From the documentation [ClueBot NG] it sounds like one of the trickier parts is understanding the WikiText enough to extract the words needed and whanot out of it.

[ANN] https://en.wikipedia.org/wiki/Artificial_neural_network [ClueBot NG] https://en.wikipedia.org/wiki/User:ClueBot_NG

-- ~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name]

Platonides

10:25 p.m.

New subject: Smart machine-learning based anti-spam system (I wish!)

Note that before training any intelligent system, be that Bayesian, Neural Networks, or other, you need a good corpus of good and bad editions, to train with...

4486

Age (days ago)

4487

Last active (days ago)

wikitech-l@lists.wikimedia.org

5 comments

4 participants

tags (0)

participants (4)

Chris Steipp
Daniel Friesen
Platonides
Tim Starling