I've been looking at some recent work that used Probabilistic Context-free Grammars[1,2] to detect vandalism in Wikipedia. I wanted to send a quick message to share some progress.
I've built a python library that implements a really simple PCFG training and scoring strategy and written a quick demo of how it can work. In the following demo, I show how we can build a probabilistic grammar using the I'm a Little Teapot song[4]. Note how sentences that are not characteristic of the song score lower. Note that scores are log-scaled.
sentences = [
... "I am a little teapot", ... "Here is my handle", ... "Here is my spout", ... "When I get all steamed up I just shout tip me over and pour me out", ... "I am a very special pot", ... "It is true", ... "Here is an example of what I can do", ... "I can turn my handle into a spout", ... "Tip me over and pour me out"]
teapot_grammar = TreeScorer.from_tree_bank(bllip_parse(s) for s in
sentences)
teapot_grammar.score(bllip_parse("Here is a little teapot"))
-9.392661928770137
teapot_grammar.score(bllip_parse("It is my handle"))
-10.296301543090733
teapot_grammar.score(bllip_parse("I am a spout"))
-10.40166205874856
teapot_grammar.score(bllip_parse("Your teapot is gay"))
-12.96352974967269
teapot_grammar.score(bllip_parse("Your mom's teapot is
asldasnldansldal")) -19.424997926026403
This work is inspired by work that Arthur Tilley did on our team a last year[5]. The 'kasami' library represents a narrow slice of Arthur's work.
Next, I'm working on building out revscoring to implement some features that use the scoring strategy on sentenced modified in an edit. I'm hoping that this type of feature engineering will allow us to catch edits that make articles more/less notable. I'm also targeting spammy language and insults.
1. https://en.wikipedia.org/wiki/Stochastic_context-free_grammar 2. http://pub.cs.sunysb.edu/~rob/papers/acl11_vandal.pdf 3. https://github.com/halfak/kasami 4. https://en.wikipedia.org/wiki/I%27m_a_Little_Teapot 5. https://github.com/aetilley/pcfg
-Aaron