There are several attempts to make bots that detect copyright violations. The problem is that there are a lot of such "infringements" that are legal, quotations for example, and then the writers gets pissed because they have used the material in a completely legal way.
I have made a Javascript-based solution that seems to solve the problem by placing a user in the loop. The only thing the script does is to mine the web for possible similar texts.
Basically the script takes the additional text, extract the plain text, excludes some of the text, breaks it into sentences, uses the sentences to build a query, rematches the result to the sentences, accumulates those and gives some warnings if a match limit is reached.
For the moment I try to extend the system to older edits, and also to make it a bit more resistant to small changes in the text. It is already fairly resistive to small reorganizations of the text.
John
2008/8/27 John Erling Blad john.erling.blad@jeb.no:
There are several attempts to make bots that detect copyright violations. The problem is that there are a lot of such "infringements" that are legal, quotations for example, and then the writers gets pissed because they have used the material in a completely legal way.
I have made a Javascript-based solution that seems to solve the problem by placing a user in the loop. The only thing the script does is to mine the web for possible similar texts.
Basically the script takes the additional text, extract the plain text, excludes some of the text, breaks it into sentences, uses the sentences to build a query, rematches the result to the sentences, accumulates those and gives some warnings if a match limit is reached.
For the moment I try to extend the system to older edits, and also to make it a bit more resistant to small changes in the text. It is already fairly resistive to small reorganizations of the text.
John
Wikiquality-l mailing list Wikiquality-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikiquality-l
Haven't you think that the bots could make a list of possible copyright infringments, and users could check this list.
My point is, it could be run 24/7 it would just need a server to run it, and it would be up to date.
On Wed, Aug 27, 2008 at 3:17 PM, Christophe Henner christophe.henner@gmail.com wrote:
2008/8/27 John Erling Blad john.erling.blad@jeb.no: Haven't you think that the bots could make a list of possible copyright infringments, and users could check this list.
My point is, it could be run 24/7 it would just need a server to run it, and it would be up to date.
Similar to the way I've seen COIBot run on meta. Compile a list and write it to some place like a user space. People could check the userspace pages for entries and follow up on them.
--Andrew Whitworth
The warning is a temporal message, given to those that run the script. Because of this its not a legal notice of a possible copyright infringement, its a tool. The distinction is very important as the first must result in a solution while the second may trigger an action.
Also, if you check the history of some of those bots you will find that they creates a lot of troubles because they post statements about copyright infringements, while either the material should be deleted on sight or the contributor should be contacted. By using an interactive tool it is the admin using it that takes action, not the tool itself.
And a last thing, if you try to run such a bot you will very quicly find that there are a lot of nearly identical texts, its like finding to eggs on the wall mart and yelling "hey its a copyvio"!
John
Andrew Whitworth skrev:
On Wed, Aug 27, 2008 at 3:17 PM, Christophe Henner christophe.henner@gmail.com wrote:
2008/8/27 John Erling Blad john.erling.blad@jeb.no: Haven't you think that the bots could make a list of possible copyright infringments, and users could check this list.
My point is, it could be run 24/7 it would just need a server to run it, and it would be up to date.
Similar to the way I've seen COIBot run on meta. Compile a list and write it to some place like a user space. People could check the userspace pages for entries and follow up on them.
--Andrew Whitworth
Wikiquality-l mailing list Wikiquality-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikiquality-l
Is this script available on-wiki somewhere?
Mike
-----Original Message----- From: wikiquality-l-bounces@lists.wikimedia.org [mailto:wikiquality-l-bounces@lists.wikimedia.org] On Behalf Of John Erling Blad Sent: August 27, 2008 4:01 PM To: Wikimedia Quality Discussions Subject: [Wikiquality-l] Detector for copyright violations
There are several attempts to make bots that detect copyright violations. The problem is that there are a lot of such "infringements" that are legal, quotations for example, and then the writers gets pissed because they have used the material in a completely legal way.
I have made a Javascript-based solution that seems to solve the problem by placing a user in the loop. The only thing the script does is to mine the web for possible similar texts.
Basically the script takes the additional text, extract the plain text, excludes some of the text, breaks it into sentences, uses the sentences to build a query, rematches the result to the sentences, accumulates those and gives some warnings if a match limit is reached.
For the moment I try to extend the system to older edits, and also to make it a bit more resistant to small changes in the text. It is already fairly resistive to small reorganizations of the text.
John
Its a gadget on no.wp, but in alpha state and there are some debugging going on. John
mike.lifeguard skrev:
Is this script available on-wiki somewhere?
Mike
-----Original Message----- From: wikiquality-l-bounces@lists.wikimedia.org [mailto:wikiquality-l-bounces@lists.wikimedia.org] On Behalf Of John Erling Blad Sent: August 27, 2008 4:01 PM To: Wikimedia Quality Discussions Subject: [Wikiquality-l] Detector for copyright violations
There are several attempts to make bots that detect copyright violations. The problem is that there are a lot of such "infringements" that are legal, quotations for example, and then the writers gets pissed because they have used the material in a completely legal way.
I have made a Javascript-based solution that seems to solve the problem by placing a user in the loop. The only thing the script does is to mine the web for possible similar texts.
Basically the script takes the additional text, extract the plain text, excludes some of the text, breaks it into sentences, uses the sentences to build a query, rematches the result to the sentences, accumulates those and gives some warnings if a match limit is reached.
For the moment I try to extend the system to older edits, and also to make it a bit more resistant to small changes in the text. It is already fairly resistive to small reorganizations of the text.
John
Wikiquality-l mailing list Wikiquality-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikiquality-l
wikiquality-l@lists.wikimedia.org