I just wanted to be really clear about what I mean as a specific
counter-example to this just being an example of "reconstructing that
rule set." Suppose you use the AbuseFilter rules on the entire history
of the wiki in order to generate a dataset of positive and negative
examples of vandalism edits. You should then *throw the rules away*
and attempt to discover features that separate the vandalism into
classes correctly, more or less in the blind.
The key then is feature discovery and a machine system has the
potential to do this is a more effective way than a human in virtue of
its ability to read the entire encyclopedia.
On Thu, Mar 19, 2009 at 2:30 PM, Brian <Brian.Mingus(a)colorado.edu> wrote:
I presented a talk at Wikimania 2007 that espoused the
virtues of
combining human measures of content with automatically determined
measures in order to generalize to unseen instances. Unfortunately all
those Wikimania talks seem to have been lost. It was related to this
article on predicting the quality ratings provided by the Wikipedia
Editorial Team:
Rassbach, L., Pincock, T., Mingus., B (2007). "Exploring the
Feasibility of Automatically Rating Online Article Quality"
http://upload.wikimedia.org/wikipedia/wikimania2007/d/d3/RassbachPincockMin…
Delerium, you do make it sound as if merely having the tagged dataset
solves the entire problem. But there are really multiple problems. One
is learning to classify what you have been told is in the dataset
(e.g., that all instances of this rule in the edit history *really
are* vandalism). The other is learning about new reasons that this
edit is vandalism based on all the other occurences of vandalism and
non-vandalism and a sophisticated pre-parse of all the content that
breaks it down into natural language features. Finally, you then wish
to use this system to bootstrap a vandalism detection system that can
generalize to entirely new instances of vandalism.
The primary way of doing this is to use positive and *negative*
examples of vandalism in conjunction with their features. A good set
of example features is an article or an edit's conformance with the
Wikipedia Manual of Style. I never implemented the entire MoS, but I
did do quite a bit of it and it is quite indicative of quality.
Generally speaking, it is not true that you can only draw conclusions
about what is immediately available in your dataset. It is true that,
with the exception of people, machine learning systems struggle with
generalization.
On Thu, Mar 19, 2009 at 6:03 AM, Delirium <delirium(a)hackish.org> wrote:
Brian wrote:
This extension is very important for training
machine learning
vandalism detection bots. Recently published systems use only hundreds
of examples of vandalism in training - not nearly enough to
distinguish between the variety found in Wikipedia or generalize to
new, unseen forms of vandalism. A large set of human created rules
could be run against all previous edits in order to create a massive
vandalism dataset.
As a machine-learning person, this seems like a somewhat
problematic
idea--- generating training examples *from a rule set* and then learning
on them is just a very roundabout way of reconstructing that rule set.
What you really want is a large dataset of human-labeled examples of
vandalism / non-vandalism that *can't* currently be distinguished
reliably by rules, so you can throw a machine-learning algorithm at the
problem of trying to come up with some.
-Mark
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l