Detector for copyright violations

List overview All Threads
Download

newer

older

WikiTrust v2 released: reputation...

Re: [Wikiquality-l] WikiTrust v2...

John Erling Blad

28 Aug 2008 28 Aug '08

3:01 a.m.

There are several attempts to make bots that detect copyright violations. The problem is that there are a lot of such "infringements" that are legal, quotations for example, and then the writers gets pissed because they have used the material in a completely legal way.

I have made a Javascript-based solution that seems to solve the problem by placing a user in the loop. The only thing the script does is to mine the web for possible similar texts.

Basically the script takes the additional text, extract the plain text, excludes some of the text, breaks it into sentences, uses the sentences to build a query, rematches the result to the sentences, accumulates those and gives some warnings if a match limit is reached.

For the moment I try to extend the system to older edits, and also to make it a bit more resistant to small changes in the text. It is already fairly resistive to small reorganizations of the text.

John

Attachments:

john_erling_blad.vcf (text/x-vcard — 181 bytes)

Show replies by date

Christophe Henner

28 Aug 28 Aug

3:17 a.m.

2008/8/27 John Erling Blad john.erling.blad@jeb.no:

...

There are several attempts to make bots that detect copyright violations. The problem is that there are a lot of such "infringements" that are legal, quotations for example, and then the writers gets pissed because they have used the material in a completely legal way.

I have made a Javascript-based solution that seems to solve the problem by placing a user in the loop. The only thing the script does is to mine the web for possible similar texts.

Basically the script takes the additional text, extract the plain text, excludes some of the text, breaks it into sentences, uses the sentences to build a query, rematches the result to the sentences, accumulates those and gives some warnings if a match limit is reached.

For the moment I try to extend the system to older edits, and also to make it a bit more resistant to small changes in the text. It is already fairly resistive to small reorganizations of the text.

John

Wikiquality-l mailing list Wikiquality-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikiquality-l

Haven't you think that the bots could make a list of possible copyright infringments, and users could check this list.

My point is, it could be run 24/7 it would just need a server to run it, and it would be up to date.

Andrew Whitworth

3:27 a.m.

On Wed, Aug 27, 2008 at 3:17 PM, Christophe Henner christophe.henner@gmail.com wrote:

...

2008/8/27 John Erling Blad john.erling.blad@jeb.no: Haven't you think that the bots could make a list of possible copyright infringments, and users could check this list.

My point is, it could be run 24/7 it would just need a server to run it, and it would be up to date.

Similar to the way I've seen COIBot run on meta. Compile a list and write it to some place like a user space. People could check the userspace pages for entries and follow up on them.

--Andrew Whitworth

John Erling Blad

4:10 a.m.

The warning is a temporal message, given to those that run the script. Because of this its not a legal notice of a possible copyright infringement, its a tool. The distinction is very important as the first must result in a solution while the second may trigger an action.

Also, if you check the history of some of those bots you will find that they creates a lot of troubles because they post statements about copyright infringements, while either the material should be deleted on sight or the contributor should be contacted. By using an interactive tool it is the admin using it that takes action, not the tool itself.

And a last thing, if you try to run such a bot you will very quicly find that there are a lot of nearly identical texts, its like finding to eggs on the wall mart and yelling "hey its a copyvio"!

John

Andrew Whitworth skrev:

...

On Wed, Aug 27, 2008 at 3:17 PM, Christophe Henner christophe.henner@gmail.com wrote:

...
2008/8/27 John Erling Blad john.erling.blad@jeb.no: Haven't you think that the bots could make a list of possible copyright infringments, and users could check this list.

My point is, it could be run 24/7 it would just need a server to run it, and it would be up to date.

Similar to the way I've seen COIBot run on meta. Compile a list and write it to some place like a user space. People could check the userspace pages for entries and follow up on them.

--Andrew Whitworth

Wikiquality-l mailing list Wikiquality-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikiquality-l

mike.lifeguard

4:37 a.m.

Is this script available on-wiki somewhere?

Mike

-----Original Message----- From: wikiquality-l-bounces@lists.wikimedia.org [mailto:wikiquality-l-bounces@lists.wikimedia.org] On Behalf Of John Erling Blad Sent: August 27, 2008 4:01 PM To: Wikimedia Quality Discussions Subject: [Wikiquality-l] Detector for copyright violations

I have made a Javascript-based solution that seems to solve the problem by placing a user in the loop. The only thing the script does is to mine the web for possible similar texts.

For the moment I try to extend the system to older edits, and also to make it a bit more resistant to small changes in the text. It is already fairly resistive to small reorganizations of the text.

John

John Erling Blad

5:09 a.m.

Its a gadget on no.wp, but in alpha state and there are some debugging going on. John

mike.lifeguard skrev:

...

Is this script available on-wiki somewhere?

Mike

-----Original Message----- From: wikiquality-l-bounces@lists.wikimedia.org [mailto:wikiquality-l-bounces@lists.wikimedia.org] On Behalf Of John Erling Blad Sent: August 27, 2008 4:01 PM To: Wikimedia Quality Discussions Subject: [Wikiquality-l] Detector for copyright violations

There are several attempts to make bots that detect copyright violations. The problem is that there are a lot of such "infringements" that are legal, quotations for example, and then the writers gets pissed because they have used the material in a completely legal way.

I have made a Javascript-based solution that seems to solve the problem by placing a user in the loop. The only thing the script does is to mine the web for possible similar texts.

Basically the script takes the additional text, extract the plain text, excludes some of the text, breaks it into sentences, uses the sentences to build a query, rematches the result to the sentences, accumulates those and gives some warnings if a match limit is reached.

For the moment I try to extend the system to older edits, and also to make it a bit more resistant to small changes in the text. It is already fairly resistive to small reorganizations of the text.

John

Wikiquality-l mailing list Wikiquality-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikiquality-l

5970

Age (days ago)

5970

Last active (days ago)

wikiquality-l@lists.wikimedia.org

5 comments

4 participants

tags (0)

participants (4)

Andrew Whitworth
Christophe Henner
John Erling Blad
mike.lifeguard