[resent, wrong address for wikipedia-l] [FUP: wikipedia-l, CC: pywikipediabot-users]
Hello,
I have recently been thinking again how wonderful my bayesian spamfilter, implemented by Spambayes[1], is working to filter my e-mail. For an explanation of Bayesian spamfiltering, see the Spambayes homepage. I was thinking whether it would be possible to do something like that for Newpages. It could reduce human work and might prove a very interesting experiment as well.
The bot I am thinking of would follow Newpages live. It fetches each page, and checks it against it database. If it's classified as ham, then continue. If it's classified as unsure, ask the user whether it is {{delete}}-material: if yes, train as spam and prepend {{delete}} to the article. If no, train as ham. It could add a comment to the article or a message to the talk page: <!-- classified by ... as ... with score ... --> If it's classified as spam, show the user (part of) the content to confirm that it's really true (if not, treat as unsure-ham). If it already contains '{{delete}}', train as spam and continue. When no user is using the program, create a stack of articles to work through when a user starts with the program again.
This would be implemented using an enhanced Pywikipediabot and the library coming with Spambayes. I foresee some problems. For example, each user would have its own 'hammy.db'. As we are all working on the same thing, we would want to have a central hammy.db, probably one per language. This would be at a central server (need not to be Wikipedia: I volunteer with my server for this task). Initially, it would be a command-line tool, although a web interface might prove very useful as well.
Additional to the contents of the page, clues can also be given by the user contributing, whether the user is logged-in or anonymous, the range of the IP, name of the page, and, why not, the time of day, although the latter might have less value than the former ones.
Perhaps it could also be done for RecentChanges. It would then be fed the diffs. This would require a lot more work, because there is a major difference between removing a line and adding a line (in fact, when one would be a spam-hint, the inverse would be a ham-hint with clue 1-other). This is much more difficult and I do not have the knowledge to write such a thing. It does not seem impossible, though.
What do you think?
kind regards, Gerrit Holl.
I think this should be on Wikitech-l, and as such am CCing it there.
John Lee ([[en:User:Johnleemk]])
Gerrit wrote:
[resent, wrong address for wikipedia-l] [FUP: wikipedia-l, CC: pywikipediabot-users]
Hello,
I have recently been thinking again how wonderful my bayesian spamfilter, implemented by Spambayes[1], is working to filter my e-mail. For an explanation of Bayesian spamfiltering, see the Spambayes homepage. I was thinking whether it would be possible to do something like that for Newpages. It could reduce human work and might prove a very interesting experiment as well.
The bot I am thinking of would follow Newpages live. It fetches each page, and checks it against it database. If it's classified as ham, then continue. If it's classified as unsure, ask the user whether it is {{delete}}-material: if yes, train as spam and prepend {{delete}} to the article. If no, train as ham. It could add a comment to the article or a message to the talk page: <!-- classified by ... as ... with score ... --> If it's classified as spam, show the user (part of) the content to confirm that it's really true (if not, treat as unsure-ham). If it already contains '{{delete}}', train as spam and continue. When no user is using the program, create a stack of articles to work through when a user starts with the program again.
This would be implemented using an enhanced Pywikipediabot and the library coming with Spambayes. I foresee some problems. For example, each user would have its own 'hammy.db'. As we are all working on the same thing, we would want to have a central hammy.db, probably one per language. This would be at a central server (need not to be Wikipedia: I volunteer with my server for this task). Initially, it would be a command-line tool, although a web interface might prove very useful as well.
Additional to the contents of the page, clues can also be given by the user contributing, whether the user is logged-in or anonymous, the range of the IP, name of the page, and, why not, the time of day, although the latter might have less value than the former ones.
Perhaps it could also be done for RecentChanges. It would then be fed the diffs. This would require a lot more work, because there is a major difference between removing a line and adding a line (in fact, when one would be a spam-hint, the inverse would be a ham-hint with clue 1-other). This is much more difficult and I do not have the knowledge to write such a thing. It does not seem impossible, though.
What do you think?
kind regards, Gerrit Holl.
wikipedia-l@lists.wikimedia.org