Date: Fri, 21 Dec 2007 10:34:47 -0800 From: "Luca de Alfaro" luca@dealfaro.org
If you want to pick out the malicious changes, you need to flag also small changes.
"Sen. Hillary Clinton did *not* vote in favor of war in Iraq"
"John Doe, born in *1947*"
The ** indicates changes.
Yes, and I did not mean to include cases such as this, which involve the insertion of a few words that could radically alter the semantic content of a unit of text. But legitimate spelling corrections (which can be easily determined using any of the various spell-checker databases to determine the set of common misspellings for a word) do not. In short, I cannot imagine a case where someone changing "Senater Clinton" to "Senator Clinton" could involve vandalism (the "smoother" algorithm should of course also take into account that if a "misspelling" appears repeatedly in an article, or even better, related subject articles by different authors, is is probably a valid technical term or a proper name). I also cannot imagine how moving a large block of relatively self-contained text (i.e. a paragraph, since even parsing at the level of sentences is problematic given all the uses for the period '.') without modifying its interior could have any large semantic repercussions (readability is, of course, a matter for a different discussion ;-)
Again, these are mainly quibbles, but for the articles I sampled it was quite annoying to have my eye repeatedly drawn to a single orange word that represented nothing more than a minor, good-faith correction. And overall the system seems to work well!