[Wikipedia-l] Automatically checking for copyright violations
Marco Krohn
marco.krohn at web.de
Mon Jun 20 21:03:07 UTC 2005
On Monday 20 June 2005 21:57, Angela wrote:
> The message below was sent to the Board today. Would implementing some
> sort of automatic copyvio checker be feasible?
I have done something similar for the German Wikipedia:
http://www.itp.uni-hannover.de/~krohn/wscan.html.utf8
it reads all newpages from German Wikipedia, shows the beginning of the text
and some statistics (and guesses which links to other articles might be
interesting). Also it takes parts of some sentences and checks whether they
appear somewhere in the internet (btw 5 to 6 consecutive words are almost
unique).
Finally the output is sorted by the number of hits ("Fundstellen"). I have
several ideas how to improve the script further (e.g. whitelists), but right
now I do not have the time to do this.
Nevertheless if someone is interested I am glad to send him the GPLed source
code (python) or surely can give some advise.
best regards,
Marco
P.S. google was so kind to extend my google key to 7000 requests per day (the
standard google key only allows 1000 requests per day which is not
sufficient)
More information about the Wikipedia-l
mailing list