Brian wrote:
Delerium, you do make it sound as if merely having the tagged dataset solves the entire problem. But there are really multiple problems. One is learning to classify what you have been told is in the dataset (e.g., that all instances of this rule in the edit history *really are* vandalism). The other is learning about new reasons that this edit is vandalism based on all the other occurences of vandalism and non-vandalism and a sophisticated pre-parse of all the content that breaks it down into natural language features. Finally, you then wish to use this system to bootstrap a vandalism detection system that can generalize to entirely new instances of vandalism.
Generally speaking, it is not true that you can only draw conclusions about what is immediately available in your dataset. It is true that, with the exception of people, machine learning systems struggle with generalization.
My point is mainly that using the *results* of an automated rule system as *input* to a machine-learning algorithm won't constitute training on "vandalism", but on "what the current rule set considers vandalism". I don't see a particularly good reason to find new reasons an edit is vandalism for edits that we already correctly predict. What we want is new discriminators for edits we *don't* correctly predict. And for those, you can't use the labels-given-by-the-current rules as the training data, since if the current rule set produces false positives, those are now positives in your training set; and if the rule set has false negatives, those are now negatives in your training set.
I suppose it could be used for proposing hypotheses to human discriminators. For example, you can propose new feature X, if you find that 95% of the time the existing rule set flags edits with feature X as vandalism, and by human inspection determine that the remaining 5% were false negatives, so actually feature X should be a new "this is vandalism" feature. But you need that human inspection--- you can't automatically discriminate between rules that improve the filter set's performance and rules that decrease it if your labeled data set is the one with the mistakes in it.
-Mark