Cobi (owner of ClueBot) and his roomate Crispy have already been
working hard to make this specific dataset, but they've been hurt by
not enough contributors. The page is here:
wiki/User:Crispy1989#New_Dataset_Contribution_Interface
X!
On Mar 19, 2009, at 8:15 AM [Mar 19, 2009 ], Tei wrote:
On Thu, Mar 19, 2009 at 1:03 PM, Delirium
<delirium(a)hackish.org>
wrote:
Brian wrote:
This extension is very important for training
machine learning
vandalism detection bots. Recently published systems use only
hundreds
of examples of vandalism in training - not nearly enough to
distinguish between the variety found in Wikipedia or generalize to
new, unseen forms of vandalism. A large set of human created rules
could be run against all previous edits in order to create a massive
vandalism dataset.
As a machine-learning person, this seems like a somewhat
problematic
idea--- generating training examples *from a rule set* and then
learning
on them is just a very roundabout way of reconstructing that rule
set.
What you really want is a large dataset of human-labeled examples of
vandalism / non-vandalism that *can't* currently be distinguished
reliably by rules, so you can throw a machine-learning algorithm
at the
problem of trying to come up with some.
since theres already a database, this sounds like could be done
flagging
edits as "vandalism", and then reading the existing database
information to
extract these details, like ip, a diff of the change, etc.. that
way,
humans define what is a "vandalism", and the machine can learn the
meaning.
this may need a button or something, so users report this, and the
database
flag the edit
--
--
ℱin del ℳensaje.
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l