I've been looking at some recent work that used Probabilistic Context-free
Grammars[1,2] to detect vandalism in Wikipedia. I wanted to send a quick
message to share some progress.
I've built a python library that implements a really simple PCFG training
and scoring strategy and written a quick demo of how it can work. In the
following demo, I show how we can build a probabilistic grammar using the
I'm a Little Teapot song[4]. Note how sentences that are not
characteristic of the song score lower. Note that scores are log-scaled.
>> sentences = [
... "I
am a little teapot",
... "Here is my handle",
... "Here is my spout",
... "When I get all steamed up I just shout tip me over and
pour me out",
... "I am a very special pot",
... "It is true",
... "Here is an example of what I can do",
... "I can turn my handle into a spout",
... "Tip me over and pour me out"]
>>
>>
>> teapot_grammar = TreeScorer.from_tree_bank(bllip_parse(s) for s in
sentences)
>>
>> teapot_grammar.score(bllip_parse("Here is a little teapot"))
-9.392661928770137
>> teapot_grammar.score(bllip_parse("It is
my handle"))
-10.296301543090733
>> teapot_grammar.score(bllip_parse("I am a
spout"))
-10.40166205874856
>> teapot_grammar.score(bllip_parse("Your
teapot is gay"))
-12.96352974967269
>> teapot_grammar.score(bllip_parse("Your
mom's teapot is
asldasnldansldal"))
-19.424997926026403
This work is inspired by work that Arthur Tilley did on our team a last
year[5]. The 'kasami' library represents a narrow slice of Arthur's work.
Next, I'm working on building out revscoring to implement some features
that use the scoring strategy on sentenced modified in an edit. I'm hoping
that this type of feature engineering will allow us to catch edits that
make articles more/less notable. I'm also targeting spammy language and
insults.
1.
https://en.wikipedia.org/wiki/Stochastic_context-free_grammar
2.
http://pub.cs.sunysb.edu/~rob/papers/acl11_vandal.pdf
3.
https://github.com/halfak/kasami
4.
https://en.wikipedia.org/wiki/I%27m_a_Little_Teapot
5.
https://github.com/aetilley/pcfg
-Aaron