Automatically checking for copyright violations

List overview All Threads
Download

newer

older

Wikipedia in Cebuano

University Challenge

Angela

21 Jun 2005 21 Jun '05

1:57 a.m.

The message below was sent to the Board today. Would implementing some sort of automatic copyvio checker be feasible?

The second part of the email suggests it is too difficult to contact us about copyright violations. With the addition of the "contact us" link in the sidebar, I thought this would stop being a problem. Is there any other way of making it easier?

Angela.

---- Forwarded message ----

...

In regards to the continuing copyright issues because some members do not respect copyrights, I might recommend implementing something like what http://copyscape.com uses. From what I can tell, they use a Google API to do a search of text found in one page to see what other pages have the same text. Using a similar methodology, you could flag new pages that are substantially like pages that exist on the Internet for further review. While this wouldn't tackle all of the copyright violations, it would go a long way towards making it easier to weed out blatant violations like the one I reported.

The issue of some individuals having absolutely no respect for copyrights and plagiarism is a serious problem that Wikipedia needs to address. Some people seem to think that Wikipedia is their personal means of bringing down copyright laws and "freeing" content. This is a shame because these individuals threaten the long term possibilities for Wikipedia.

On a related note, it should be easier to report copyright violations on the Wikipedia website. The current set up is tremendously burdensome to figure how to report copyright violations. There needs to be a simple link from all pages to a simple contact form that allows one to report a violation without having any knowledge of how Wikipedia works. Doing this would put members on notice that Wikipedia isn't a rogue operation where anything goes and that it takes copyright issues seriously.

Show replies by date

Marco Krohn

21 Jun 21 Jun

3:03 a.m.

On Monday 20 June 2005 21:57, Angela wrote:

...

The message below was sent to the Board today. Would implementing some sort of automatic copyvio checker be feasible?

I have done something similar for the German Wikipedia:

http://www.itp.uni-hannover.de/~krohn/wscan.html.utf8

it reads all newpages from German Wikipedia, shows the beginning of the text and some statistics (and guesses which links to other articles might be interesting). Also it takes parts of some sentences and checks whether they appear somewhere in the internet (btw 5 to 6 consecutive words are almost unique).

Finally the output is sorted by the number of hits ("Fundstellen"). I have several ideas how to improve the script further (e.g. whitelists), but right now I do not have the time to do this.

Nevertheless if someone is interested I am glad to send him the GPLed source code (python) or surely can give some advise.

best regards, Marco

P.S. google was so kind to extend my google key to 7000 requests per day (the standard google key only allows 1000 requests per day which is not sufficient)

Marco Krohn

3:51 a.m.

On Monday 20 June 2005 23:03, Marco Krohn wrote:

...

I have done something similar for the German Wikipedia:

http://www.itp.uni-hannover.de/~krohn/wscan.html.utf8

There was just an example which shows the script in action:

http://www.itp.uni-hannover.de/~krohn/copyvio.png

shows that four parts of sentences have been checked ("Geprüfte Satzteile"). The last line ("Fundstellen") tells us that (3) and (4) have been found on the website

http://www.classical-composers.org/cgi-bin/ccd.cgi?comp=pierne_paul

If you compare this webpage with the new de.wikipedia article http://de.wikipedia.org/wiki/Paul_Piern%C3%A9

you will see that both pages are indeed very similar:

WP: "Er war Organist an St-Paul-St-Louis in Paris"

c-c.org: "Er war Organist an St-Paul-St-Louis in Paris."

WP: "komponierte er zwei Sinfonien, eine sinfonische Dichtung, ein Konzert für Oboe, Cello und Orchester, kammermusikalische Werke, Klavier- und Orgelstücke sowie ein Messe, ein Oratorium und Chorwerke."

c-c.org: "schrieb 2 Sinfonien, ein sinfonisches Gedicht, ein Konzert für Oboe, Cello und Orchester, eine Messe, ein Oratorium, Chorwerke, Kammermusik, Klavier- und Orgelstücke, mehrere Ballette und 2 Opern."

"our" editor modified some words and changed the order of some, but nevertheless the similarity is high enough to find this with a script. I do not want to discuss whether this is a copyvio, but it is clear that the wp user was probably "inspired" by the c-c.org web page ;-)

best regards, Marco

Mark Williamson

7 a.m.

What worries me here is copyright paranoia.

Isn't it likely that people will see the flag, and think "copyright! AHHHHHHH!!! NOOOOOOOOOO!!! KILL KILL KILL!!!!!!!!!!!!!!"

Mark

On 20/06/05, Marco Krohn marco.krohn@web.de wrote:

...

On Monday 20 June 2005 23:03, Marco Krohn wrote:

...
I have done something similar for the German Wikipedia:

http://www.itp.uni-hannover.de/~krohn/wscan.html.utf8

There was just an example which shows the script in action:

http://www.itp.uni-hannover.de/~krohn/copyvio.png

shows that four parts of sentences have been checked ("Geprüfte Satzteile"). The last line ("Fundstellen") tells us that (3) and (4) have been found on the website

http://www.classical-composers.org/cgi-bin/ccd.cgi?comp=pierne_paul

If you compare this webpage with the new de.wikipedia article http://de.wikipedia.org/wiki/Paul_Piern%C3%A9

you will see that both pages are indeed very similar:

WP: "Er war Organist an St-Paul-St-Louis in Paris"

c-c.org: "Er war Organist an St-Paul-St-Louis in Paris."

WP: "komponierte er zwei Sinfonien, eine sinfonische Dichtung, ein Konzert für Oboe, Cello und Orchester, kammermusikalische Werke, Klavier- und Orgelstücke sowie ein Messe, ein Oratorium und Chorwerke."

c-c.org: "schrieb 2 Sinfonien, ein sinfonisches Gedicht, ein Konzert für Oboe, Cello und Orchester, eine Messe, ein Oratorium, Chorwerke, Kammermusik, Klavier- und Orgelstücke, mehrere Ballette und 2 Opern."

"our" editor modified some words and changed the order of some, but nevertheless the similarity is high enough to find this with a script. I do not want to discuss whether this is a copyvio, but it is clear that the wp user was probably "inspired" by the c-c.org web page ;-)

best regards, Marco _______________________________________________ Wikipedia-l mailing list Wikipedia-l@Wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikipedia-l

-- SI HOC LEGERE SCIS NIMIVM ERVDITIONIS HABES QVANTVM MATERIAE MATERIETVR MARMOTA MONAX SI MARMOTA MONAX MATERIAM POSSIT MATERIARI ESTNE VOLVMEN IN TOGA AN SOLVM TIBI LIBET ME VIDERE

Puddl Duk

7:10 a.m.

On 6/20/05, Marco Krohn marco.krohn@web.de wrote:

...

On Monday 20 June 2005 21:57, Angela wrote:

...
The message below was sent to the Board today. Would implementing some sort of automatic copyvio checker be feasible?

I have done something similar for the German Wikipedia:

http://www.itp.uni-hannover.de/~krohn/wscan.html.utf8

it reads all newpages from German Wikipedia, shows the beginning of the text and some statistics (and guesses which links to other articles might be interesting). Also it takes parts of some sentences and checks whether they appear somewhere in the internet (btw 5 to 6 consecutive words are almost unique).

Finally the output is sorted by the number of hits ("Fundstellen"). I have several ideas how to improve the script further (e.g. whitelists), but right now I do not have the time to do this.

Nevertheless if someone is interested I am glad to send him the GPLed source code (python) or surely can give some advise.

best regards, Marco

P.S. google was so kind to extend my google key to 7000 requests per day (the standard google key only allows 1000 requests per day which is not sufficient)

I've written something similiar, very rough, I'm not a programmer.

It can usually find about 20 the 30 significant copyright violations in a day's previous newpages on :en. It also gets a lot of false positives, I haven't finish parsing out all the templates.

CDVF could use a plugin along these lines, it would make a neat programming contest.

Stirling Newberry

10:21 a.m.

New subject: Wikitorials shutdown

http://www.bopnews.com/archives/003730.html#3730

Andrew Lih

1 p.m.

New subject: Wikitorials shutdown

On 6/21/05, Stirling Newberry stirling.newberry@xigenics.net wrote:

...

http://www.bopnews.com/archives/003730.html#3730

LA Times' Wikitorials will perhaps go down in history as the [[Disco Demolition Night]] of the Internet.

-User:Fuzheado

PS: That would be http://en.wikipedia.org/wiki/Disco_Demolition_Night for the bracket challenged :)

fun＠thingy.apana.org.au

6:26 p.m.

New subject: Wikitorials shutdown

Andrew Lih (andrew.lih@gmail.com) [050621 17:00]:

...

On 6/21/05, Stirling Newberry stirling.newberry@xigenics.net wrote:

...
http://www.bopnews.com/archives/003730.html#3730

How annoying. I thought it was a marvellous experiment, and was surprised how non-crap the results were. (e.g. beter than Indymedia, though that's not hard).

...

LA Times' Wikitorials will perhaps go down in history as the [[Disco Demolition Night]] of the Internet.

I don't think it was that bad ;-)

- d.

Andrew Lih

6:52 p.m.

New subject: Wikitorials shutdown

On 6/21/05, David Gerard fun@thingy.apana.org.au wrote:

...

Andrew Lih (andrew.lih@gmail.com) [050621 17:00]:

...
On 6/21/05, Stirling Newberry stirling.newberry@xigenics.net wrote:

...
http://www.bopnews.com/archives/003730.html#3730

How annoying. I thought it was a marvellous experiment, and was surprised how non-crap the results were. (e.g. beter than Indymedia, though that's not hard).

...
LA Times' Wikitorials will perhaps go down in history as the [[Disco Demolition Night]] of the Internet.

I don't think it was that bad ;-)

The more the LA Times talks about it, the more I'm convinced they don't "get it". To wit:

"As long as we can hit a high standard and have no risk of vandalism, then it is worth having a try at it again," said Rob Barrett, general manager of Los Angeles Times Interactive.

http://www.latimes.com/news/nationworld/nation/la-na-wiki21jun21,0,1952611.s...

"No risk of vandalism" eh? :)

-User:Fuzheado

Stirling Newberry

7:03 p.m.

New subject: Wikitorials shutdown

On Jun 21, 2005, at 8:26 AM, David Gerard wrote:

...

Andrew Lih (andrew.lih@gmail.com) [050621 17:00]:

...
On 6/21/05, Stirling Newberry stirling.newberry@xigenics.net wrote:

...
http://www.bopnews.com/archives/003730.html#3730

How annoying. I thought it was a marvellous experiment, and was surprised how non-crap the results were. (e.g. beter than Indymedia, though that's not hard).

It's a first pass. The reality of the internet is that there is soemthing of value - namely access to channel and link equity, which isn't valued in terms of money, but which is scarce. The problem is how to negotiate boht the width and the access to the channe; without the most usual way of managing scarcity - id est an abstract exchange market. The LA Times failed to do this, and got the expected results.

Fred Bauder

6:39 p.m.

New subject: Wikitorials shutdown

I'm not sure your analysis is that spot on. Mostly we are just looking at culture shock and impatience. I'm pretty sure those hundreds of people who did contribute could, after they learn the software, revert the dirty pictures and move on in about the same way we do. They were depending on a newspaper employee to monitor the editorial not on the contributors. But he went to bed.

Fred

On Jun 20, 2005, at 10:21 PM, Stirling Newberry wrote:

...

http://www.bopnews.com/archives/003730.html#3730

Wikipedia-l mailing list Wikipedia-l@Wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikipedia-l

Stirling Newberry

8:25 p.m.

New subject: Wikitorials shutdown

On Jun 21, 2005, at 8:39 AM, Fred Bauder wrote:

...

I'm not sure your analysis is that spot on. Mostly we are just looking at culture shock and impatience. I'm pretty sure those hundreds of people who did contribute could, after they learn the software, revert the dirty pictures and move on in about the same way we do. They were depending on a newspaper employee to monitor the editorial not on the contributors. But he went to bed.

"They were depending". Exactly - the channel equity belonged to someone else, and therefore it was not that urgent for them to defend it.

Mark Williamson

3:11 a.m.

...and it would also flag every single page in Wikipedia, because they can also be found in absoluteastronomy, etc.

If you're talking about new pages only, it might be OK but it depends how long search strings are -- is it looking for 7 words in a row that are identical, 10 words, or 50?

Mark

On 20/06/05, Angela beesley@gmail.com wrote:

...

The message below was sent to the Board today. Would implementing some sort of automatic copyvio checker be feasible?

The second part of the email suggests it is too difficult to contact us about copyright violations. With the addition of the "contact us" link in the sidebar, I thought this would stop being a problem. Is there any other way of making it easier?

Angela.

---- Forwarded message ----

...
In regards to the continuing copyright issues because some members do not respect copyrights, I might recommend implementing something like what http://copyscape.com uses. From what I can tell, they use a Google API to do a search of text found in one page to see what other pages have the same text. Using a similar methodology, you could flag new pages that are substantially like pages that exist on the Internet for further review. While this wouldn't tackle all of the copyright violations, it would go a long way towards making it easier to weed out blatant violations like the one I reported.

The issue of some individuals having absolutely no respect for copyrights and plagiarism is a serious problem that Wikipedia needs to address. Some people seem to think that Wikipedia is their personal means of bringing down copyright laws and "freeing" content. This is a shame because these individuals threaten the long term possibilities for Wikipedia.

On a related note, it should be easier to report copyright violations on the Wikipedia website. The current set up is tremendously burdensome to figure how to report copyright violations. There needs to be a simple link from all pages to a simple contact form that allows one to report a violation without having any knowledge of how Wikipedia works. Doing this would put members on notice that Wikipedia isn't a rogue operation where anything goes and that it takes copyright issues seriously.

Wikipedia-l mailing list Wikipedia-l@Wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikipedia-l

-- SI HOC LEGERE SCIS NIMIVM ERVDITIONIS HABES QVANTVM MATERIAE MATERIETVR MARMOTA MONAX SI MARMOTA MONAX MATERIAM POSSIT MATERIARI ESTNE VOLVMEN IN TOGA AN SOLVM TIBI LIBET ME VIDERE

Marco Krohn

3:32 a.m.

On Monday 20 June 2005 23:11, Mark Williamson wrote:

...

...and it would also flag every single page in Wikipedia, because they can also be found in absoluteastronomy, etc.

it is possible to do the google search with "-wikipedia" which removes most of the mirrors from the google results. Also the script could automatically filter mirrors, but nevertheless you are right that it is far easier to consider new pages only.

Concerning the number of words: I found that in most cases 5-6 words in a row are unique (of course there are exceptions). But if one website contains three times the same combination of 5-6 words you can be sure that this is not by chance. Of course a more detailed analysis still is needed, e.g. there are public domain resources such as the Brockhaus 1911 etc.

A complete automatic analysis of copyright violations without too many false positives is a difficult problem. On the other hand it might be sufficient to improve the tools for the editors.

best regards, Marco

Alphax

12:24 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Mark Williamson wrote:

...

...and it would also flag every single page in Wikipedia, because they can also be found in absoluteastronomy, etc.

If you're talking about new pages only, it might be OK but it depends how long search strings are -- is it looking for 7 words in a row that are identical, 10 words, or 50?

Mark

I've always held that anything over 5-6 words is plagiarism, unless it is quoted. Quotations are fair use provided they are cited appropriately.

Then again, if someone changes every seventh word, in a sentance of 25 words, that's still 14-18 words ripped straight from another source.

- -- Alphax OpenPGP key: 0xF874C613 - http://tinyurl.com/cc9up http://en.wikipedia.org/wiki/User:Alphax There are two kinds of people: those who say to God, 'Thy will be done,' and those to whom God says, 'All right, then, have it your way.' - C. S. Lewis

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.1 (MingW32) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iD8DBQFCt7Kq/RxM5Ph0xhMRAjiuAKCGR44zdgYc2Zp8gUxLc74i1wPYEQCfSKI1 oZB4MpasWod8YIrJXvv0fDc= =yTiJ -----END PGP SIGNATURE-----

Tim Starling

12:57 p.m.

Alphax wrote:

...

I've always held that anything over 5-6 words is plagiarism, unless it is quoted. Quotations are fair use provided they are cited appropriately.

Your unattributed quote "quotations are fair use" is plagiarism, and that is unacceptable. You should give Joseph Carter credit where it is due. He wrote "attributed quotations are fair use" in a post to debian-legal: http://lists.debian.org/debian-legal/2001/05/msg00001.html , and your post was clearly ripped off from that.

Seriously though, I have seen a case where a Wikipedian slapped a copyvio tag on something because it shared some phrases with a webpage. The author complained that he had spent hours reading multiple sources, and rewriting the information therein in his own words. That is unequivocally acceptable under copyright law, and the tag was soon removed. There's no need to be paranoid. We should be careful not to accuse people of plagiarism who are merely paraphrasing or rewriting.

-- Tim Starling

Alphax

1:20 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Tim Starling wrote:

...

Alphax wrote:

...
I've always held that anything over 5-6 words is plagiarism, unless it is quoted. Quotations are fair use provided they are cited appropriately.

Your unattributed quote "quotations are fair use" is plagiarism, and that is unacceptable. You should give Joseph Carter credit where it is due. He wrote "attributed quotations are fair use" in a post to debian-legal: http://lists.debian.org/debian-legal/2001/05/msg00001.html , and your post was clearly ripped off from that.

Well then, better let Joseph Carter know that Gilbert B. Rodman wrote, in 1998 - http://www.iaspm.net/rpm/CopyRi_1.html - "reasonable" quotations are "fair use."

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.1 (MingW32) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iD8DBQFCt7+6/RxM5Ph0xhMRAlDCAJ0aME2+hWZED1rHXB2aD/MxR+aM+QCfa6X8 Se3iaybomiAJ8+zXd85r8aU= =MBV8 -----END PGP SIGNATURE-----

Mark Williamson

4:34 p.m.

In myself, much astonishment can be found. Tim, sources copyrighted have been, by you, off-ripped.

"was clearly ripped off from" from http://www.windowsitpro.com/Articles/Index.cfm?ArticleID=19265&DisplayTa..., has, by you, been clearly plagiarised.

Unacceptability can be multitudinous in the plagiarism-like acts.

Easily can be spotted plagiarism, since obviously longer sequence than 4 words from some oddly-existing source must have been themselves plagiarised.

On you shame I cast.

Mark

On 20/06/05, Tim Starling t.starling@physics.unimelb.edu.au wrote:

...

Alphax wrote:

...
I've always held that anything over 5-6 words is plagiarism, unless it is quoted. Quotations are fair use provided they are cited appropriately.

Your unattributed quote "quotations are fair use" is plagiarism, and that is unacceptable. You should give Joseph Carter credit where it is due. He wrote "attributed quotations are fair use" in a post to debian-legal: http://lists.debian.org/debian-legal/2001/05/msg00001.html , and your post was clearly ripped off from that.

Seriously though, I have seen a case where a Wikipedian slapped a copyvio tag on something because it shared some phrases with a webpage. The author complained that he had spent hours reading multiple sources, and rewriting the information therein in his own words. That is unequivocally acceptable under copyright law, and the tag was soon removed. There's no need to be paranoid. We should be careful not to accuse people of plagiarism who are merely paraphrasing or rewriting.

-- Tim Starling

Wikipedia-l mailing list Wikipedia-l@Wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikipedia-l

-- SI HOC LEGERE SCIS NIMIVM ERVDITIONIS HABES QVANTVM MATERIAE MATERIETVR MARMOTA MONAX SI MARMOTA MONAX MATERIAM POSSIT MATERIARI ESTNE VOLVMEN IN TOGA AN SOLVM TIBI LIBET ME VIDERE

Alphax

7:55 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Mark Williamson wrote:

...

In myself, much astonishment can be found. Tim, sources copyrighted have been, by you, off-ripped.

"was clearly ripped off from" from http://www.windowsitpro.com/Articles/Index.cfm?ArticleID=19265&DisplayTa..., has, by you, been clearly plagiarised.

Unacceptability can be multitudinous in the plagiarism-like acts.

Easily can be spotted plagiarism, since obviously longer sequence than 4 words from some oddly-existing source must have been themselves plagiarised.

On you shame I cast.

Mark

Lucasfilm. Will. Sue. You. For. Patent. Infringement. Of. Yoda.

(And Shatner and Paramount are chasing me with rusty pitchforks! Argh!)

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.1 (MingW32) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iD8DBQFCuBxg/RxM5Ph0xhMRAvfBAKCoYMi0SMP031t9B9ZHaBuCgOg22QCeOFwf aMZiLM0oBX2KOzu005DzI+c= =Ppgd -----END PGP SIGNATURE-----

Roger Luethi

8:21 p.m.

On Mon, 20 Jun 2005 21:57:37 +0200, Angela wrote:

...

The second part of the email suggests it is too difficult to contact us about copyright violations. With the addition of the "contact us" link in the sidebar, I thought this would stop being a problem. Is there any other way of making it easier?

How about clarifying policies so editors actually know what they are supposed to do? I've been harping about this on quite a few occasions: All clearly spelled out policy is concerned with "the most recent edit is a copyvio". What if it is only noticed several or many edits later?

I have looked for WP policy on this very issue, and like others who did the same, I could not find anything conclusive.

In my opinion, we must revert to the last version that was not a copyvio, and then salvage whatever we can from later edits. However, in my experience, mine is a minority position. Which is why I gave up on tracking down copyvios -- what's the point if people reinstate the copyvio version and then change a couple of words to disguise what they did?

Roger

7096

Age (days ago)

7097

Last active (days ago)

wikipedia-l@lists.wikimedia.org

19 comments

11 participants

tags (0)

participants (11)

Alphax
Andrew Lih
Angela
Fred Bauder
fun＠thingy.apana.org.au
Marco Krohn
Mark Williamson
Puddl Duk
Roger Luethi
Stirling Newberry
Tim Starling