[Wikipedia-l] Automatically checking for copyright violations
Marco Krohn
marco.krohn at web.de
Mon Jun 20 21:32:53 UTC 2005
On Monday 20 June 2005 23:11, Mark Williamson wrote:
> ...and it would also flag every single page in Wikipedia, because they
> can also be found in absoluteastronomy, etc.
it is possible to do the google search with "-wikipedia" which removes most of
the mirrors from the google results. Also the script could automatically
filter mirrors, but nevertheless you are right that it is far easier to
consider new pages only.
Concerning the number of words: I found that in most cases 5-6 words in a row
are unique (of course there are exceptions). But if one website contains
three times the same combination of 5-6 words you can be sure that this is
not by chance. Of course a more detailed analysis still is needed, e.g. there
are public domain resources such as the Brockhaus 1911 etc.
A complete automatic analysis of copyright violations without too many false
positives is a difficult problem. On the other hand it might be sufficient to
improve the tools for the editors.
best regards,
Marco
More information about the Wikipedia-l
mailing list