[Wikipedia-l] Automatically checking for copyright violations

Puddl Duk puddlduk at gmail.com
Tue Jun 21 01:10:00 UTC 2005


On 6/20/05, Marco Krohn <marco.krohn at web.de> wrote:
> On Monday 20 June 2005 21:57, Angela wrote:
> > The message below was sent to the Board today. Would implementing some
> > sort of automatic copyvio checker be feasible?
> 
> I have done something similar for the German Wikipedia:
> 
> http://www.itp.uni-hannover.de/~krohn/wscan.html.utf8
> 
> it reads all newpages from German Wikipedia, shows the beginning of the text
> and some statistics (and guesses which links to other articles might be
> interesting). Also it takes parts of some sentences and checks whether they
> appear somewhere in the internet (btw 5 to 6 consecutive words are almost
> unique).
> 
> Finally the output is sorted by the number of hits ("Fundstellen"). I have
> several ideas how to improve the script further (e.g. whitelists), but right
> now I do not have the time to do this.
> 
> Nevertheless if someone is interested I am glad to send him the GPLed source
> code (python) or surely can give some advise.
> 
> best regards,
>   Marco
> 
> P.S. google was so kind to extend my google key to 7000 requests per day (the
> standard google key only allows 1000 requests per day which is not
> sufficient)

I've written something similiar, very rough, I'm not a programmer.

It can usually find about 20 the 30 significant copyright violations
in a day's previous newpages on :en. It also gets a lot of false
positives, I haven't finish parsing out all the templates.

CDVF could use a plugin along these lines, it would make a neat
programming contest.



More information about the Wikipedia-l mailing list