Having spent a good portion of my academic life in the field of Pattern Recognition, and having read the CRM114 articles in depth, my gut tells me that some of the other methods (throttling, etc.) would likely reduce Wikipedian Vandalism sufficiently that a Markovian approach would give little additional benefit. Making vandalism easy to fix, like undoing all edits from a particular user/IP with one switch also sound very useful.
I do like the idea of flagging suspect pages and I also really liked the idea of flagging anonymous users at a higher suspect rate than logged in users. I suspect that a "heuristic" (lots of rules) model similar to that used by SpamAssasin where you can plug in new rules for new threats would probably be the best way to solve the Vandalism problem for Wikipedia over time (although I don't have a really good idea from the discussion thus far as to whether it is already a serious issue or not).
For example, one heuristic would be "anonymous user" add a couple of points to the "Spam" score. Another heuristic, "this user/IP has added lots of edits really quickly" would add a few points. If a user was in a "safe" list as a mass spell checker/grammar checker, then you would subtract a bunch of points. Heuristics for the type of article might be useful. I suspect that political articles are particularly targeted by vandals, for example. The point is, a heuristic engine adapts over time, whereas a Markovian model would perhaps serve as ONE good heuristic within the larger engine. Markovian or Baysian engines tend to get "tired" over time. My Baysian email spam filter is now so tired of spam that it sees nearly everything as spam, whether it is or isn't (of course it was trained on nearly 300,000 spams before it reached that point.)
When a heuristic engine is coupled with human interaction and double checking (as would be the natural case with Wikipedia) this becomes a great system. You would just have a special suspected vandalism page that is generated much as the special statistics type pages are generated now.
Knowing the seriousness of the problem at this point would drive whether or not this feature should be developed quickly, and that is knowledge I don't posess at this point.
-Kelly
At 08:17 PM 3/14/2004, you wrote:
So, I just installed the CRM114 Markovian spam filtering software:
http://crm114.sourceforge.net/The whole thing is based on Bayesian filtering, which is just a way to make very dumb software make really smart decisions. With sufficient training, a very simple piece of software can make very accurate distinctions between spam and non-spam email messages. See Paul Graham's famous "A Plan for Spam" about this:
http://www.paulgraham.com/spam.htmlThe CRM114 stuff is Markovian, which means it's _even_dumber_ than Bayesian stuff, and makes _even_smarter_ decisions. More or less.
Anyways, one thing that's mentioned on the crm114 page is that folks use the same technology for different kinds of text sorting. Like, for system administrators, they can sort log file entries into ones they're interested in and ones they're not.
And I was thinking: you know, it'd be nice to be able to flag acceptable and problematic articles in MediaWiki Web sites. Like, say, an admin sees some vandalism going on, and goes to fix it. One of the checkmarks on saving is "Vandalism fix" or some such. This would tag the previous version as... ungood. Something.
And then after a while the software gets good at understanding what's ungood and what's not. And there could be a tracking page to say, "These seem to be pages in an ungood state." And it would be easier to find those and fix 'em.
~ESP
-- Evan Prodromou evan@wikitravel.org Wikitravel - http://www.wikitravel.org/ The free, complete, up-to-date and reliable world-wide travel guide _______________________________________________ Wikitech-l mailing list Wikitech-l@Wikipedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l