[Wikipedia-l] Automatically checking for copyright violations

Mon Jun 20 21:32:53 UTC 2005

On Monday 20 June 2005 23:11, Mark Williamson wrote:
> ...and it would also flag every single page in Wikipedia, because they
> can also be found in absoluteastronomy, etc.

it is possible to do the google search with "-wikipedia" which removes most of 
the mirrors from the google results. Also the script could automatically 
filter mirrors, but nevertheless you are right that it is far easier to 
consider new pages only. 

Concerning the number of words: I found that in most cases 5-6 words in a row 
are unique (of course there are exceptions). But if one website contains 
three times the same combination of 5-6 words you can be sure that this is 
not by chance. Of course a more detailed analysis still is needed, e.g. there 
are public domain resources such as the Brockhaus 1911 etc.

A complete automatic analysis of copyright violations without too many false 
positives is a difficult problem. On the other hand it might be sufficient to 
improve the tools for the editors.

best regards,
  Marco