I've built a python library that implements a really simple PCFG training and scoring strategy and written a quick demo of how it can work. In the following demo, I show how we can build a probabilistic grammar using the I'm a Little Teapot song. Note how sentences that are not characteristic of the song score lower. Note that scores are log-scaled.
>>> sentences = [
... "I am a little teapot",
... "Here is my handle",
... "Here is my spout",
... "When I get all steamed up I just shout tip me over and pour me out",
... "I am a very special pot",
... "It is true",
... "Here is an example of what I can do",
... "I can turn my handle into a spout",
... "Tip me over and pour me out"]
>>> teapot_grammar = TreeScorer.from_tree_bank(bllip_parse(s) for s in sentences)
>>> teapot_grammar.score(bllip_parse("Here is a little teapot"))
>>> teapot_grammar.score(bllip_parse("It is my handle"))
>>> teapot_grammar.score(bllip_parse("I am a spout"))
>>> teapot_grammar.score(bllip_parse("Your teapot is gay"))
>>> teapot_grammar.score(bllip_parse("Your mom's teapot is asldasnldansldal"))
This work is inspired by work that Arthur Tilley did on our team a last year. The 'kasami' library represents a narrow slice of Arthur's work.
Next, I'm working on building out revscoring to implement some features that use the scoring strategy on sentenced modified in an edit. I'm hoping that this type of feature engineering will allow us to catch edits that make articles more/less notable. I'm also targeting spammy language and insults.