On Monday 20 June 2005 23:11, Mark Williamson wrote:
...and it would also flag every single page in Wikipedia, because they can also be found in absoluteastronomy, etc.
it is possible to do the google search with "-wikipedia" which removes most of the mirrors from the google results. Also the script could automatically filter mirrors, but nevertheless you are right that it is far easier to consider new pages only.
Concerning the number of words: I found that in most cases 5-6 words in a row are unique (of course there are exceptions). But if one website contains three times the same combination of 5-6 words you can be sure that this is not by chance. Of course a more detailed analysis still is needed, e.g. there are public domain resources such as the Brockhaus 1911 etc.
A complete automatic analysis of copyright violations without too many false positives is a difficult problem. On the other hand it might be sufficient to improve the tools for the editors.
best regards, Marco