On Thu, 16 Aug 2012 16:50:27 -0700, Tim Starling tstarling@wikimedia.org wrote:
On 17/08/12 04:16, Daniel Friesen wrote:
Of course. While I have the whole idea for the ui, backend stuff, how to handle the service, etc... I haven't done the actual machine-learning stuff before.
I would think that the actual machine learning stuff would be the hard part. I stopped using Thunderbird's Bayesian spam tagging feature years ago, when it started sorting emails from smart people in with the spam. The computer thought that the smart people were using long words with a similar frequency to the random dictionary words that padded out the spam messages.
I haven't worked with machine learning either, but I'm guessing it's not as simple as feeding a pre-tagged data set into a stock Bayesian filter library.
-- Tim Starling
Yeah, Bayesian is probably too old to use. ClueBot NG appears to be using an Abstract Neural Network [ANN] implementation to do it's spam testing. From the documentation [ClueBot NG] it sounds like one of the trickier parts is understanding the WikiText enough to extract the words needed and whanot out of it.
[ANN] https://en.wikipedia.org/wiki/Artificial_neural_network [ClueBot NG] https://en.wikipedia.org/wiki/User:ClueBot_NG