Re: [Wikitech-l] Abuse Filter extension activated on English Wikipedia

19 Mar 2009


      I presented a talk at Wikimania 2007 that espoused the virtues of
combining human measures of content with automatically determined
measures in order to generalize to unseen instances. Unfortunately all
those Wikimania talks seem to have been lost. It was related to this
article on predicting the quality ratings provided by the Wikipedia
Editorial Team:
Rassbach, L., Pincock, T., Mingus., B (2007). "Exploring the
Feasibility of Automatically Rating Online Article Quality"
http://upload.wikimedia.org/wikipedia/wikimania2007/d/d3/RassbachPincockMing...
Delerium, you do make it sound as if merely having the tagged dataset
solves the entire problem. But there are really multiple problems. One
is learning to classify what you have been told is in the dataset
(e.g., that all instances of this rule in the edit history *really
are* vandalism). The other is learning about new reasons that this
edit is vandalism based on all the other occurences of vandalism and
non-vandalism and a sophisticated pre-parse of all the content that
breaks it down into natural language features.  Finally, you then wish
to use this system to bootstrap a vandalism detection system that can
generalize to entirely new instances of vandalism.
The primary way of doing this is to use positive and *negative*
examples of vandalism in conjunction with their features. A good set
of example features is an article or an edit's conformance with the
Wikipedia Manual of Style. I never implemented the entire MoS, but I
did do quite a bit of it and it is quite indicative of quality.
Generally speaking, it is not true that you can only draw conclusions
about what is immediately available in your dataset. It is true that,
with the exception of people, machine learning systems struggle with
generalization.
On Thu, Mar 19, 2009 at 6:03 AM, Delirium delirium@hackish.org wrote:
...
Brian wrote:
...
This extension is very important for training  machine learning
vandalism detection bots. Recently published systems use only hundreds
of examples of vandalism in training - not nearly enough to
distinguish between the variety found in Wikipedia or generalize to
new, unseen forms of vandalism. A large set of human created rules
could be run against all previous edits in order to create a massive
vandalism dataset.
As a machine-learning person, this seems like a somewhat problematic
idea--- generating training examples *from a rule set* and then learning
on them is just a very roundabout way of reconstructing that rule set.
What you really want is a large dataset of human-labeled examples of
vandalism / non-vandalism that *can't* currently be distinguished
reliably by rules, so you can throw a machine-learning algorithm at the
problem of trying to come up with some.
-Mark

Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Abuse Filter extension activated on English Wikipedia