Having spent a good portion of my academic life in the field of Pattern
Recognition, and having read the CRM114 articles in depth, my gut tells me
that some of the other methods (throttling, etc.) would likely reduce
Wikipedian Vandalism sufficiently that a Markovian approach would give
little additional benefit. Making vandalism easy to fix, like undoing all
edits from a particular user/IP with one switch also sound very useful.
I do like the idea of flagging suspect pages and I also really liked the
idea of flagging anonymous users at a higher suspect rate than logged in
users. I suspect that a "heuristic" (lots of rules) model similar to that
used by SpamAssasin where you can plug in new rules for new threats would
probably be the best way to solve the Vandalism problem for Wikipedia over
time (although I don't have a really good idea from the discussion thus far
as to whether it is already a serious issue or not).
For example, one heuristic would be "anonymous user" add a couple of points
to the "Spam" score. Another heuristic, "this user/IP has added lots of
edits really quickly" would add a few points. If a user was in a "safe"
list as a mass spell checker/grammar checker, then you would subtract a
bunch of points. Heuristics for the type of article might be useful. I
suspect that political articles are particularly targeted by vandals, for
example. The point is, a heuristic engine adapts over time, whereas a
Markovian model would perhaps serve as ONE good heuristic within the larger
engine. Markovian or Baysian engines tend to get "tired" over time. My
Baysian email spam filter is now so tired of spam that it sees nearly
everything as spam, whether it is or isn't (of course it was trained on
nearly 300,000 spams before it reached that point.)
When a heuristic engine is coupled with human interaction and double
checking (as would be the natural case with Wikipedia) this becomes a great
system. You would just have a special suspected vandalism page that is
generated much as the special statistics type pages are generated now.
Knowing the seriousness of the problem at this point would drive whether or
not this feature should be developed quickly, and that is knowledge I don't
posess at this point.
-Kelly
At 08:17 PM 3/14/2004, you wrote:
So, I just installed the CRM114 Markovian spam
filtering software:
http://crm114.sourceforge.net/
The whole thing is based on Bayesian filtering, which is just a way to
make very dumb software make really smart decisions. With sufficient
training, a very simple piece of software can make very accurate
distinctions between spam and non-spam email messages. See Paul
Graham's famous "A Plan for Spam" about this:
http://www.paulgraham.com/spam.html
The CRM114 stuff is Markovian, which means it's _even_dumber_ than
Bayesian stuff, and makes _even_smarter_ decisions. More or less.
Anyways, one thing that's mentioned on the crm114 page is that folks
use the same technology for different kinds of text sorting. Like, for
system administrators, they can sort log file entries into ones
they're interested in and ones they're not.
And I was thinking: you know, it'd be nice to be able to flag
acceptable and problematic articles in MediaWiki Web sites. Like, say,
an admin sees some vandalism going on, and goes to fix it. One of the
checkmarks on saving is "Vandalism fix" or some such. This would tag
the previous version as... ungood. Something.
And then after a while the software gets good at understanding what's
ungood and what's not. And there could be a tracking page to say,
"These seem to be pages in an ungood state." And it would be easier to
find those and fix 'em.
~ESP
--
Evan Prodromou <evan(a)wikitravel.org>
Wikitravel -
http://www.wikitravel.org/
The free, complete, up-to-date and reliable world-wide travel guide
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)Wikipedia.org
http://mail.wikipedia.org/mailman/listinfo/wikitech-l