Re: [Wikipedia-l] Automatically checking for copyright violations

21 Jun 2005


      On 6/20/05, Marco Krohn marco.krohn@web.de wrote:
...
On Monday 20 June 2005 21:57, Angela wrote:
...
The message below was sent to the Board today. Would implementing some
sort of automatic copyvio checker be feasible?
I have done something similar for the German Wikipedia:
http://www.itp.uni-hannover.de/~krohn/wscan.html.utf8
it reads all newpages from German Wikipedia, shows the beginning of the text
and some statistics (and guesses which links to other articles might be
interesting). Also it takes parts of some sentences and checks whether they
appear somewhere in the internet (btw 5 to 6 consecutive words are almost
unique).
Finally the output is sorted by the number of hits ("Fundstellen"). I have
several ideas how to improve the script further (e.g. whitelists), but right
now I do not have the time to do this.
Nevertheless if someone is interested I am glad to send him the GPLed source
code (python) or surely can give some advise.
best regards,
  Marco
P.S. google was so kind to extend my google key to 7000 requests per day (the
standard google key only allows 1000 requests per day which is not
sufficient)
I've written something similiar, very rough, I'm not a programmer.
It can usually find about 20 the 30 significant copyright violations
in a day's previous newpages on :en. It also gets a lot of false
positives, I haven't finish parsing out all the templates.
CDVF could use a plugin along these lines, it would make a neat
programming contest.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

Re: [Wikipedia-l] Automatically checking for copyright violations