Gregory Maxwell wrote:
On 3/22/06, Ilmari Karonen nospam@vyznev.net wrote:
One could ignore edits that have been reverted. Detecting reverts, in the strict sense of the word, is easy: all you need is a hash value for each revision.
Of course, this wouldn't be perfect. But it'd be as close to perfect as any automated system can be. And it _would_ skip most vandals.
And what happens if the next edit merges some content back in from the reverted text?
This case falls under "not perfect but as close as can be". It's essentially the same problem as someone pasting content from another article, or from another source entirely. Even your diff-based scheme, while nifty indeed, doesn't solve that. In general, nothing can.
By the way, it might be possible to optimize your scheme by using some form of histogram analysis to quickly establish lower bounds on edit distances. For that matter, if you're not using it already, even just the difference in article lengths gives a weak lower bound on the edit distance. Meanwhile, hashing can be used to establish upper bounds, both by hashing the entire text to detect exact reversions and by hashing deterministically chosen chunks (such as article sections) to detect local changes.