[Wikipedia-l] Bayesian filtering to find suspicious new pages and changes

Gerrit gerrit at nl.linux.org
Mon Nov 22 10:32:32 UTC 2004


[resent, wrong address for wikipedia-l]
[FUP: wikipedia-l, CC: pywikipediabot-users]

Hello,

I have recently been thinking again how wonderful my bayesian
spamfilter, implemented by Spambayes[1], is working to filter my e-mail.
For an explanation of Bayesian spamfiltering, see the Spambayes homepage.
I was thinking whether it would be possible to do something like that
for Newpages. It could reduce human work and might prove a very
interesting experiment as well.

The bot I am thinking of would follow Newpages live. It fetches each
page, and checks it against it database. If it's classified as ham, then
continue. If it's classified as unsure, ask the user whether it is
{{delete}}-material: if yes, train as spam and prepend {{delete}} to the
article. If no, train as ham. It could add a comment to the article or a
message to the talk page: <!-- classified by ... as ... with score ...  -->
If it's classified as spam, show the user (part of) the content to
confirm that it's really true (if not, treat as unsure-ham).
If it already contains '{{delete}}', train as spam and continue.
When no user is using the program, create a stack of articles to work
through when a user starts with the program again.

This would be implemented using an enhanced Pywikipediabot and the
library coming with Spambayes. I foresee some problems. For example,
each user would have its own 'hammy.db'. As we are all working on the
same thing, we would want to have a central hammy.db, probably one per
language. This would be at a central server (need not to be Wikipedia: I
volunteer with my server for this task). Initially, it would be a
command-line tool, although a web interface might prove very useful as
well.

Additional to the contents of the page, clues can also be given by the
user contributing, whether the user is logged-in or anonymous, the range
of the IP, name of the page, and, why not, the time of day, although the
latter might have less value than the former ones.

Perhaps it could also be done for RecentChanges. It would then be fed
the diffs. This would require a lot more work, because there is a major
difference between removing a line and adding a line (in fact, when one
would be a spam-hint, the inverse would be a ham-hint with clue 1-other).
This is much more difficult and I do not have the knowledge to write
such a thing. It does not seem impossible, though.

What do you think?

kind regards,
Gerrit Holl.

[1] http://www.spambayes.org/

-- 
Weather in Lulea / Kallax, Sweden 22/11 09:50:
	-19.0°C   wind 0.9 m/s NW (34 m above NAP)
-- 
In the councils of government, we must guard against the acquisition of
unwarranted influence, whether sought or unsought, by the
military-industrial complex. The potential for the disastrous rise of
misplaced power exists and will persist.
    -Dwight David Eisenhower, January 17, 1961



More information about the Wikipedia-l mailing list