Re: [Wikipedia-l] Automatically checking for copyright violations

21 Jun 2005


      On Monday 20 June 2005 23:11, Mark Williamson wrote:
...
...and it would also flag every single page in Wikipedia, because they
can also be found in absoluteastronomy, etc.
it is possible to do the google search with "-wikipedia" which removes most of 
the mirrors from the google results. Also the script could automatically 
filter mirrors, but nevertheless you are right that it is far easier to 
consider new pages only.
Concerning the number of words: I found that in most cases 5-6 words in a row 
are unique (of course there are exceptions). But if one website contains 
three times the same combination of 5-6 words you can be sure that this is 
not by chance. Of course a more detailed analysis still is needed, e.g. there 
are public domain resources such as the Brockhaus 1911 etc.
A complete automatic analysis of copyright violations without too many false 
positives is a difficult problem. On the other hand it might be sufficient to 
improve the tools for the editors.
best regards,
  Marco

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

Re: [Wikipedia-l] Automatically checking for copyright violations