Re: [Wikitech-l] Abuse Filter extension activated on English Wikipedia

19 Mar 2009

I just wanted to be really clear about what I mean as a specific
counter-example to this just being an example of "reconstructing that
rule set." Suppose you use the AbuseFilter rules on the entire history
of the wiki in order to generate a dataset of positive and negative
examples of vandalism edits. You should then *throw the rules away*
and attempt to discover  features that separate the vandalism into
classes correctly, more or less in the blind.

The key then is feature discovery and a machine system has the
potential to do this is a more effective way than a human in virtue of
its ability to read the entire encyclopedia.

On Thu, Mar 19, 2009 at 2:30 PM, Brian &lt;Brian.Mingus(a)colorado.edu&gt; wrote:
...
  I presented a talk at Wikimania 2007 that espoused the
virtues of
 combining human measures of content with automatically determined
 measures in order to generalize to unseen instances. Unfortunately all
 those Wikimania talks seem to have been lost. It was related to this
 article on predicting the quality ratings provided by the Wikipedia
 Editorial Team:

 Rassbach, L., Pincock, T., Mingus., B (2007). "Exploring the
 Feasibility of Automatically Rating Online Article Quality"
 http://upload.wikimedia.org/wikipedia/wikimania2007/d/d3/RassbachPincockMin…

 Delerium, you do make it sound as if merely having the tagged dataset
 solves the entire problem. But there are really multiple problems. One
 is learning to classify what you have been told is in the dataset
 (e.g., that all instances of this rule in the edit history *really
 are* vandalism). The other is learning about new reasons that this
 edit is vandalism based on all the other occurences of vandalism and
 non-vandalism and a sophisticated pre-parse of all the content that
 breaks it down into natural language features.  Finally, you then wish
 to use this system to bootstrap a vandalism detection system that can
 generalize to entirely new instances of vandalism.

 The primary way of doing this is to use positive and *negative*
 examples of vandalism in conjunction with their features. A good set
 of example features is an article or an edit's conformance with the
 Wikipedia Manual of Style. I never implemented the entire MoS, but I
 did do quite a bit of it and it is quite indicative of quality.

 Generally speaking, it is not true that you can only draw conclusions
 about what is immediately available in your dataset. It is true that,
 with the exception of people, machine learning systems struggle with
 generalization.

 On Thu, Mar 19, 2009 at 6:03 AM, Delirium &lt;delirium(a)hackish.org&gt; wrote:
  Brian wrote:
  This extension is very important for training
 machine learning
 vandalism detection bots. Recently published systems use only hundreds
 of examples of vandalism in training - not nearly enough to
 distinguish between the variety found in Wikipedia or generalize to
 new, unseen forms of vandalism. A large set of human created rules
 could be run against all previous edits in order to create a massive
 vandalism dataset.  As a machine-learning person, this seems like a somewhat
problematic
 idea--- generating training examples *from a rule set* and then learning
 on them is just a very roundabout way of reconstructing that rule set.
 What you really want is a large dataset of human-labeled examples of
 vandalism / non-vandalism that *can't* currently be distinguished
 reliably by rules, so you can throw a machine-learning algorithm at the
 problem of trying to come up with some.

 -Mark

 _______________________________________________
 Wikitech-l mailing list
 Wikitech-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Abuse Filter extension activated on English Wikipedia