On 6/20/05, Marco Krohn marco.krohn@web.de wrote:
On Monday 20 June 2005 21:57, Angela wrote:
The message below was sent to the Board today. Would implementing some sort of automatic copyvio checker be feasible?
I have done something similar for the German Wikipedia:
http://www.itp.uni-hannover.de/~krohn/wscan.html.utf8
it reads all newpages from German Wikipedia, shows the beginning of the text and some statistics (and guesses which links to other articles might be interesting). Also it takes parts of some sentences and checks whether they appear somewhere in the internet (btw 5 to 6 consecutive words are almost unique).
Finally the output is sorted by the number of hits ("Fundstellen"). I have several ideas how to improve the script further (e.g. whitelists), but right now I do not have the time to do this.
Nevertheless if someone is interested I am glad to send him the GPLed source code (python) or surely can give some advise.
best regards, Marco
P.S. google was so kind to extend my google key to 7000 requests per day (the standard google key only allows 1000 requests per day which is not sufficient)
I've written something similiar, very rough, I'm not a programmer.
It can usually find about 20 the 30 significant copyright violations in a day's previous newpages on :en. It also gets a lot of false positives, I haven't finish parsing out all the templates.
CDVF could use a plugin along these lines, it would make a neat programming contest.