On 14/04/13 15:41, anubhav agarwal wrote:
I don't we could take in account the roll back for automated learning. It is not necessary that the person who edited the document, then rolled it back did because it was a spam.
Getting the right data to train from is hard, since wiki is so flexible. The good point of rollback is that a) It's easy to detect, b) It's restricted (a random user can't use it) and c) On some wikis policy restricts it's use to “clearly bad edits”.
So you _should_ be training with "unwanted edits". But there will be false positives.
Though a "Train as spam" checkbox is a good idea. I was thinking about the "report spam" button along with "edit" button on the top-right hand corner of a section.
However, that only tells you that "somewhere in the page there is spam", not what the spam is (the last revision? an edit from 2 months ago?) nor does it encourage for fixing it.
I was thinking of creating a Job Queue for big websites like Wikipedia, each edit will go in a queue which will be processed offline and then later roll backed to the original content if it triggers the alarm.
I'm not a big fan of this. You will have edit-conflicts to handle, and it looks messy to have reverts by an extension. I recommend you to work on the bayesian detection of spam, and leave the potential refactoring to configure it to work through the job queue for later.
I think I could look in the archives of deleted pages from the WM-ES wiki for spam data for you.