[Wikipedia-l] Automatically checking for copyright violations

Marco Krohn marco.krohn at web.de
Mon Jun 20 21:03:07 UTC 2005


On Monday 20 June 2005 21:57, Angela wrote:
> The message below was sent to the Board today. Would implementing some
> sort of automatic copyvio checker be feasible?

I have done something similar for the German Wikipedia:

http://www.itp.uni-hannover.de/~krohn/wscan.html.utf8

it reads all newpages from German Wikipedia, shows the beginning of the text 
and some statistics (and guesses which links to other articles might be 
interesting). Also it takes parts of some sentences and checks whether they 
appear somewhere in the internet (btw 5 to 6 consecutive words are almost 
unique). 

Finally the output is sorted by the number of hits ("Fundstellen"). I have 
several ideas how to improve the script further (e.g. whitelists), but right 
now I do not have the time to do this.

Nevertheless if someone is interested I am glad to send him the GPLed source 
code (python) or surely can give some advise.

best regards,
  Marco

P.S. google was so kind to extend my google key to 7000 requests per day (the 
standard google key only allows 1000 requests per day which is not 
sufficient)



More information about the Wikipedia-l mailing list