There are several attempts to make bots that detect copyright
violations. The problem is that there are a lot of such "infringements"
that are legal, quotations for example, and then the writers gets pissed
because they have used the material in a completely legal way.
I have made a Javascript-based solution that seems to solve the problem
by placing a user in the loop. The only thing the script does is to mine
the web for possible similar texts.
Basically the script takes the additional text, extract the plain text,
excludes some of the text, breaks it into sentences, uses the sentences
to build a query, rematches the result to the sentences, accumulates
those and gives some warnings if a match limit is reached.
For the moment I try to extend the system to older edits, and also to
make it a bit more resistant to small changes in the text. It is already
fairly resistive to small reorganizations of the text.
John