There are several attempts to make bots that detect copyright violations. The problem is that there are a lot of such "infringements" that are legal, quotations for example, and then the writers gets pissed because they have used the material in a completely legal way.
I have made a Javascript-based solution that seems to solve the problem by placing a user in the loop. The only thing the script does is to mine the web for possible similar texts.
Basically the script takes the additional text, extract the plain text, excludes some of the text, breaks it into sentences, uses the sentences to build a query, rematches the result to the sentences, accumulates those and gives some warnings if a match limit is reached.
For the moment I try to extend the system to older edits, and also to make it a bit more resistant to small changes in the text. It is already fairly resistive to small reorganizations of the text.
John