I presented a talk at Wikimania 2007 that espoused the virtues of combining human measures of content with automatically determined measures in order to generalize to unseen instances. Unfortunately all those Wikimania talks seem to have been lost. It was related to this article on predicting the quality ratings provided by the Wikipedia Editorial Team:
Rassbach, L., Pincock, T., Mingus., B (2007). "Exploring the Feasibility of Automatically Rating Online Article Quality" http://upload.wikimedia.org/wikipedia/wikimania2007/d/d3/RassbachPincockMing...
Delerium, you do make it sound as if merely having the tagged dataset solves the entire problem. But there are really multiple problems. One is learning to classify what you have been told is in the dataset (e.g., that all instances of this rule in the edit history *really are* vandalism). The other is learning about new reasons that this edit is vandalism based on all the other occurences of vandalism and non-vandalism and a sophisticated pre-parse of all the content that breaks it down into natural language features. Finally, you then wish to use this system to bootstrap a vandalism detection system that can generalize to entirely new instances of vandalism.
The primary way of doing this is to use positive and *negative* examples of vandalism in conjunction with their features. A good set of example features is an article or an edit's conformance with the Wikipedia Manual of Style. I never implemented the entire MoS, but I did do quite a bit of it and it is quite indicative of quality.
Generally speaking, it is not true that you can only draw conclusions about what is immediately available in your dataset. It is true that, with the exception of people, machine learning systems struggle with generalization.
On Thu, Mar 19, 2009 at 6:03 AM, Delirium delirium@hackish.org wrote:
Brian wrote:
This extension is very important for training machine learning vandalism detection bots. Recently published systems use only hundreds of examples of vandalism in training - not nearly enough to distinguish between the variety found in Wikipedia or generalize to new, unseen forms of vandalism. A large set of human created rules could be run against all previous edits in order to create a massive vandalism dataset.
As a machine-learning person, this seems like a somewhat problematic idea--- generating training examples *from a rule set* and then learning on them is just a very roundabout way of reconstructing that rule set. What you really want is a large dataset of human-labeled examples of vandalism / non-vandalism that *can't* currently be distinguished reliably by rules, so you can throw a machine-learning algorithm at the problem of trying to come up with some.
-Mark
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l