Copyright Violation Bot

List overview All Threads
Download

newer

older

"Assume good faith" is no longer...

new Google Earth tie-in (!!)

Chris Picone

23 Nov 2006 23 Nov '06

10:55 a.m.

I've been seeing a rise in in-article copyvios. Last night I got one in [[Content managment system]]. I know that only some paragrahs have these copyvios, and not entire articles, so complete rewirtes aren't necessary. Thus, I'm attempting to write a script that (a) opens tabs with "Special:Random" on them (b) select the first setence from each paragraph (line break) (c) Google the sentence (d) If there are any exact matches not from en.wikipedia.org, put up a little message for me to check and remove the copyvio. (e) repeat.

Problem is, all I know is Applescript. If any of you Perl or pywikipedia or AWB-types have another way of writing this, can someone write it so the general community can use it to remove copyvios? (or is this possible with AWB?)

Chris (Ccool2ax)

Show replies by date

James Hare

23 Nov 23 Nov

10:59 a.m.

When featuring the script, don't limit it to just excluding en.wikipedia.org-- in fact, filter out any mention of "wikipedia". And yes, it's good that you're setting it so that it'll alert you instead of immediately removing the text, since Google can provide false positives.

On 11/23/06, Chris Picone ccool2ax@gmail.com wrote:

...

I've been seeing a rise in in-article copyvios. Last night I got one in [[Content managment system]]. I know that only some paragrahs have these copyvios, and not entire articles, so complete rewirtes aren't necessary. Thus, I'm attempting to write a script that (a) opens tabs with "Special:Random" on them (b) select the first setence from each paragraph (line break) (c) Google the sentence (d) If there are any exact matches not from en.wikipedia.org, put up a little message for me to check and remove the copyvio. (e) repeat.

Problem is, all I know is Applescript. If any of you Perl or pywikipedia or AWB-types have another way of writing this, can someone write it so the general community can use it to remove copyvios? (or is this possible with AWB?)

Chris (Ccool2ax) _______________________________________________ WikiEN-l mailing list WikiEN-l@Wikipedia.org To unsubscribe from this mailing list, visit: http://mail.wikipedia.org/mailman/listinfo/wikien-l

geni

11:02 a.m.

On 11/23/06, James Hare messedrocker@gmail.com wrote:

...

When featuring the script, don't limit it to just excluding en.wikipedia.org-- in fact, filter out any mention of "wikipedia".

There are a lot of mirrors out there that do not credit us. DB had a lot of problems with his bot.

-- geni

Chris Picone

11:11 a.m.

Well, I'll use the list of mirrors in my script too and since I manually check each edit, I can pretty easily tell that it's an other-way-around copyvio (e.g. if wikilinks/section headers are preserved)

Chris

On 11/23/06, geni geniice@gmail.com wrote:

...

On 11/23/06, James Hare messedrocker@gmail.com wrote:

...
When featuring the script, don't limit it to just excluding en.wikipedia.org-- in fact, filter out any mention of "wikipedia".

There are a lot of mirrors out there that do not credit us. DB had a lot of problems with his bot.

-- geni _______________________________________________ WikiEN-l mailing list WikiEN-l@Wikipedia.org To unsubscribe from this mailing list, visit: http://mail.wikipedia.org/mailman/listinfo/wikien-l

MacGyverMagic/Mgm

11:34 a.m.

Yes, please work on a anti-copyvio bot. I've seen several bad things on the increase (you don't want to see newpages right now) Any tool that automates part of the process is useful. Like someone said earlier: it's good if you make it semi-automatic, to filter out false positives. Ask help of other bot-builders if you need it.

Mgm

On 11/23/06, Chris Picone ccool2ax@gmail.com wrote:

...

Well, I'll use the list of mirrors in my script too and since I manually check each edit, I can pretty easily tell that it's an other-way-around copyvio (e.g. if wikilinks/section headers are preserved)

Chris

On 11/23/06, geni geniice@gmail.com wrote:

...
On 11/23/06, James Hare messedrocker@gmail.com wrote:

...
When featuring the script, don't limit it to just excluding en.wikipedia.org-- in fact, filter out any mention of "wikipedia".

There are a lot of mirrors out there that do not credit us. DB had a lot of problems with his bot.

-- geni _______________________________________________ WikiEN-l mailing list WikiEN-l@Wikipedia.org To unsubscribe from this mailing list, visit: http://mail.wikipedia.org/mailman/listinfo/wikien-l

WikiEN-l mailing list WikiEN-l@Wikipedia.org To unsubscribe from this mailing list, visit: http://mail.wikipedia.org/mailman/listinfo/wikien-l

Chris Picone

3:34 p.m.

Coding the semi-auto bot in Applescript is hard, because Applescript ain't too powerful and I'm not quite good enough. I'm going to keep trying to code it for a few days, but just so I don't waste effort, is AWB capable of this sorta thing? I mean, selecting random sentences and putting them on the clipboard for Googling.

If I can't finish this, would any of you Wikipedians be so kind as to write a script for me to run? Thanks, Chris

On 11/23/06, MacGyverMagic/Mgm macgyvermagic@gmail.com wrote:

...

Yes, please work on a anti-copyvio bot. I've seen several bad things on the increase (you don't want to see newpages right now) Any tool that automates part of the process is useful. Like someone said earlier: it's good if you make it semi-automatic, to filter out false positives. Ask help of other bot-builders if you need it.

Mgm

On 11/23/06, Chris Picone ccool2ax@gmail.com wrote:

...
Well, I'll use the list of mirrors in my script too and since I manually check each edit, I can pretty easily tell that it's an other-way-around copyvio (e.g. if wikilinks/section headers are preserved)

Chris

On 11/23/06, geni geniice@gmail.com wrote:

...
On 11/23/06, James Hare messedrocker@gmail.com wrote:

...
When featuring the script, don't limit it to just excluding en.wikipedia.org-- in fact, filter out any mention of "wikipedia".

There are a lot of mirrors out there that do not credit us. DB had a lot of problems with his bot.

-- geni _______________________________________________ WikiEN-l mailing list WikiEN-l@Wikipedia.org To unsubscribe from this mailing list, visit: http://mail.wikipedia.org/mailman/listinfo/wikien-l

WikiEN-l mailing list WikiEN-l@Wikipedia.org To unsubscribe from this mailing list, visit: http://mail.wikipedia.org/mailman/listinfo/wikien-l

WikiEN-l mailing list WikiEN-l@Wikipedia.org To unsubscribe from this mailing list, visit: http://mail.wikipedia.org/mailman/listinfo/wikien-l

Earle Martin

24 Nov 24 Nov

1:08 p.m.

On 23/11/06, Chris Picone ccool2ax@gmail.com wrote:

...

If I can't finish this, would any of you Wikipedians be so kind as to write a script for me to run? Thanks,

Sure. Try this: http://downlode.org/Code/Perl/Wikipedia/copyviofinder.txt

It's a little rough and ready, but seems to work. A couple of random examples:

[[USS Dale (DD-290)]] Search phrase: Dale operated with Destroyer Squadrons, Scouting Fleet, on the Atlantic Found: http://www.historycentral.com/navy/destroyer/DaleIIIdd290.html http://www.globalsecurity.org/military/agency/navy/dd-290.htm http://www.hazegray.org/danfs/destroy/dd290txt.htm

[[Cincinnati Bengals]] Search phrase: The Cincinnati Bengals are a professional American football team based Found: http://www.jazzsportsnews.com/nfl/teams/64.html http://www.goupstate.com/apps/pbcs.dll/section?category=NEWS&template=wi... ati_Bengals http://www.acheapseat.com/cincinnati_bengals_tickets.html http://www.blinkbits.com/bits/viewforum/cincinnati_bengals_bio?f=2552 http://www.frontrowking.com/football/cincinnati_bengals_tickets.html http://www.ticketstogo.com/sport/nfl/Cincinnati-Bengals.html http://www.free-picks.org/nfl/articles/bengals.html http://www.foxboroticketking.com/vscincinnatibengals.html [lots more URLs snipped]

Whether the copyvio is an inward or outward bound one in each case is sadly beyond the scope of my programming skills, so I leave that to you.

-- Earle Martin http://downlode.org/ http://purl.org/net/earlemartin/

James Hare

1:10 p.m.

The fact that people continually insert copyvios into articles is very distressing.

On 11/24/06, Earle Martin wikipedia@downlode.org wrote:

...

On 23/11/06, Chris Picone ccool2ax@gmail.com wrote:

...
If I can't finish this, would any of you Wikipedians be so kind as to write a script for me to run? Thanks,

Sure. Try this: http://downlode.org/Code/Perl/Wikipedia/copyviofinder.txt

It's a little rough and ready, but seems to work. A couple of random examples:

[[USS Dale (DD-290)]] Search phrase: Dale operated with Destroyer Squadrons, Scouting Fleet, on the Atlantic Found: http://www.historycentral.com/navy/destroyer/DaleIIIdd290.html http://www.globalsecurity.org/military/agency/navy/dd-290.htm http://www.hazegray.org/danfs/destroy/dd290txt.htm

[[Cincinnati Bengals]] Search phrase: The Cincinnati Bengals are a professional American football team based Found: http://www.jazzsportsnews.com/nfl/teams/64.html

http://www.goupstate.com/apps/pbcs.dll/section?category=NEWS&template=wi... ati_Bengals http://www.acheapseat.com/cincinnati_bengals_tickets.html http://www.blinkbits.com/bits/viewforum/cincinnati_bengals_bio?f=2552 http://www.frontrowking.com/football/cincinnati_bengals_tickets.html http://www.ticketstogo.com/sport/nfl/Cincinnati-Bengals.html http://www.free-picks.org/nfl/articles/bengals.html http://www.foxboroticketking.com/vscincinnatibengals.html [lots more URLs snipped]

Whether the copyvio is an inward or outward bound one in each case is sadly beyond the scope of my programming skills, so I leave that to you.

-- Earle Martin http://downlode.org/ http://purl.org/net/earlemartin/ _______________________________________________ WikiEN-l mailing list WikiEN-l@Wikipedia.org To unsubscribe from this mailing list, visit: http://mail.wikipedia.org/mailman/listinfo/wikien-l

Sean Barrett

1:32 p.m.

James Hare stated for the record:

...

The fact that people continually insert copyvios into articles is very distressing.

Absolutely true, but what is also distressing is the number of people (*cough* Trekphiler *cough*) who cannot get their minds around the concept that material explicitly released into the public domain (*cough* DANFS *cough*) is not copyright-reserved and its credited use is not plagiarism.

-- Sean Barrett | The only thing we have to fear is fear itself-- sean@epoptic.com | nameless, unreasoning, unjustified terror | which paralyzes needed efforts to convert | retreat into advance. --Franklin D. Roosevelt

Michiel Sikma

27 Nov 27 Nov

3:57 a.m.

Not to mention there are other sources with licenses compatible to that of the GFDL. I hope the copyvio bot won't be fooled by Answers.com. :)

Op 24-nov-2006, om 19:32 heeft Sean Barrett het volgende geschreven:

...

Absolutely true, but what is also distressing is the number of people (*cough* Trekphiler *cough*) who cannot get their minds around the concept that material explicitly released into the public domain (*cough* DANFS *cough*) is not copyright-reserved and its credited use is not plagiarism.

Michiel Sikma michiel@thingmajig.org Web designer and programmer

Earle Martin

5:58 a.m.

On 27/11/06, Michiel Sikma michiel@thingmajig.org wrote:

...

Not to mention there are other sources with licenses compatible to that of the GFDL. I hope the copyvio bot won't be fooled by Answers.com. :)

As it stands I've coded in an ignore list to skip certain domains - generally speaking, WP mirrors - one of which is answers.com.

-- Earle Martin http://downlode.org/ http://purl.org/net/earlemartin/

Steve Bennett

7:26 p.m.

On 11/27/06, Earle Martin wikipedia@downlode.org wrote:

...

As it stands I've coded in an ignore list to skip certain domains - generally speaking, WP mirrors - one of which is answers.com.

There is somewhere a pretty complete list of WP mirrors, official and otherwise. It was intended as a data file to a greasemonkey scrpt, to filter these results from Google searches. You could probably adapt it, using my incredibly helpful and explicit directions to find it.

Steve

Earle Martin

28 Nov 28 Nov

9:10 a.m.

On 28/11/06, Steve Bennett stevagewp@gmail.com wrote:

...

There is somewhere a pretty complete list of WP mirrors, official and otherwise. It was intended as a data file to a greasemonkey scrpt, to filter these results from Google searches. You could probably adapt it, using my incredibly helpful and explicit directions to find it.

Aha, it looks like you're talking about this: http://meta.wikimedia.org/wiki/Mirror_filter

Thanks very much, I'll work it in.

-- Earle Martin http://downlode.org/ http://purl.org/net/earlemartin/

Alphax (Wikipedia email)

29 Nov 29 Nov

12:59 a.m.

Earle Martin wrote:

...

On 28/11/06, Steve Bennett stevagewp@gmail.com wrote:

...
There is somewhere a pretty complete list of WP mirrors, official and otherwise. It was intended as a data file to a greasemonkey scrpt, to filter these results from Google searches. You could probably adapt it, using my incredibly helpful and explicit directions to find it.

Aha, it looks like you're talking about this: http://meta.wikimedia.org/wiki/Mirror_filter

Thanks very much, I'll work it in.

The other page, useful for when you find non-compliant mirrors, is [[Wikipedia:Mirrors and forks]] (or similar).

-- Alphax - http://en.wikipedia.org/wiki/User:Alphax Contributor to Wikipedia, the Free Encyclopedia "We make the internet not suck" - Jimbo Wales Public key: http://en.wikipedia.org/wiki/User:Alphax/OpenPGP

Chris Picone

11:49 p.m.

Yeah. Since all edits are manual anyways (and mirrors are usually low on Google), the "bot"'s confusion can't hurt.

On 11/28/06, Alphax (Wikipedia email) alphasigmax@gmail.com wrote:

...

Earle Martin wrote:

...
On 28/11/06, Steve Bennett stevagewp@gmail.com wrote:

...
There is somewhere a pretty complete list of WP mirrors, official and otherwise. It was intended as a data file to a greasemonkey scrpt, to filter these results from Google searches. You could probably adapt it, using my incredibly helpful and explicit directions to find it.

Aha, it looks like you're talking about this: http://meta.wikimedia.org/wiki/Mirror_filter

Thanks very much, I'll work it in.

The other page, useful for when you find non-compliant mirrors, is [[Wikipedia:Mirrors and forks]] (or similar).

-- Alphax - http://en.wikipedia.org/wiki/User:Alphax Contributor to Wikipedia, the Free Encyclopedia "We make the internet not suck" - Jimbo Wales Public key: http://en.wikipedia.org/wiki/User:Alphax/OpenPGP

WikiEN-l mailing list WikiEN-l@Wikipedia.org To unsubscribe from this mailing list, visit: http://mail.wikipedia.org/mailman/listinfo/wikien-l

Chris Picone

20 Dec 20 Dec

9:38 p.m.

Update on an old thread: The so-called "bot" has an account (incase the people at BRFA think I should use a flagged account). I currently run the script to find copyvios in my spare time; I've found three. The galaxy is at peace. --Chris is me

On 11/29/06, Chris Picone ccool2ax@gmail.com wrote:

...

Yeah. Since all edits are manual anyways (and mirrors are usually low on Google), the "bot"'s confusion can't hurt.

On 11/28/06, Alphax (Wikipedia email) alphasigmax@gmail.com wrote:

...
Earle Martin wrote:

...
On 28/11/06, Steve Bennett stevagewp@gmail.com wrote:

...
There is somewhere a pretty complete list of WP mirrors, official and otherwise. It was intended as a data file to a greasemonkey scrpt, to filter these results from Google searches. You could probably adapt it, using my incredibly helpful and explicit directions to find it.

Aha, it looks like you're talking about this: http://meta.wikimedia.org/wiki/Mirror_filter

Thanks very much, I'll work it in.

The other page, useful for when you find non-compliant mirrors, is [[Wikipedia:Mirrors and forks]] (or similar).

-- Alphax - http://en.wikipedia.org/wiki/User:Alphax Contributor to Wikipedia, the Free Encyclopedia "We make the internet not suck" - Jimbo Wales Public key: http://en.wikipedia.org/wiki/User:Alphax/OpenPGP

WikiEN-l mailing list WikiEN-l@Wikipedia.org To unsubscribe from this mailing list, visit: http://mail.wikipedia.org/mailman/listinfo/wikien-l

Steve Bennett

9:56 p.m.

On 12/21/06, Chris Picone ccool2ax@gmail.com wrote:

...

Update on an old thread: The so-called "bot" has an account (incase the people at BRFA think I should use a flagged account). I currently run the script to find copyvios in my spare time; I've found three. The galaxy is at peace.

Cool :)

Found any instances of other sites plagiarising us without giving credit? I found one once where a woman was even using photos I'd taken on her site...I wasn't sure whether to be flattered, offended, or just...confused.

Steve

Ryan Wetherell

21 Dec 21 Dec

9:59 a.m.

On 12/20/06, Steve Bennett stevagewp@gmail.com wrote:

...

I found one once where a woman was even using photos I'd taken on her site...I wasn't sure whether to be flattered, offended, or just...confused.

Steve

I guess that depends on the context in which the photos were being used.

--Ryan

Sherool

24 Dec 24 Dec

9:28 a.m.

On Thu, 21 Dec 2006 15:59:21 +0100, Ryan Wetherell renardius@gmail.com wrote:

...

On 12/20/06, Steve Bennett stevagewp@gmail.com wrote:

...
I found one once where a woman was even using photos I'd taken on her site...I wasn't sure whether to be flattered, offended, or just...confused.

Steve

I guess that depends on the context in which the photos were being used.

Heh, the Register used one of my photos in theyr "Wikipedia Chicken attacked" article (mostly because it happens to be the first image in the article I guess). That's cool though, I released it to the public domain so I can hardly complain about people using it, and in that context it would have been a fair use either way I guess.

-- [[:en:User:Sherool]]

Chris Picone

26 Nov 26 Nov

8:48 a.m.

It's terrible. Many are obvious, but some are so subtle that I only found them through randomly Googling a phrase. (also, thanks a bundle for the Perl script. I got halfway there with Applescript, but now I don't have to worry!)

Chris

On 11/24/06, James Hare messedrocker@gmail.com wrote:

...

The fact that people continually insert copyvios into articles is very distressing.

On 11/24/06, Earle Martin wikipedia@downlode.org wrote:

...
On 23/11/06, Chris Picone ccool2ax@gmail.com wrote:

...
If I can't finish this, would any of you Wikipedians be so kind as to write a script for me to run? Thanks,

Sure. Try this: http://downlode.org/Code/Perl/Wikipedia/copyviofinder.txt

It's a little rough and ready, but seems to work. A couple of random examples:

[[USS Dale (DD-290)]] Search phrase: Dale operated with Destroyer Squadrons, Scouting Fleet, on the Atlantic Found: http://www.historycentral.com/navy/destroyer/DaleIIIdd290.html http://www.globalsecurity.org/military/agency/navy/dd-290.htm http://www.hazegray.org/danfs/destroy/dd290txt.htm

[[Cincinnati Bengals]] Search phrase: The Cincinnati Bengals are a professional American football team based Found: http://www.jazzsportsnews.com/nfl/teams/64.html

http://www.goupstate.com/apps/pbcs.dll/section?category=NEWS&template=wi... ati_Bengals http://www.acheapseat.com/cincinnati_bengals_tickets.html http://www.blinkbits.com/bits/viewforum/cincinnati_bengals_bio?f=2552 http://www.frontrowking.com/football/cincinnati_bengals_tickets.html http://www.ticketstogo.com/sport/nfl/Cincinnati-Bengals.html http://www.free-picks.org/nfl/articles/bengals.html http://www.foxboroticketking.com/vscincinnatibengals.html [lots more URLs snipped]

Whether the copyvio is an inward or outward bound one in each case is sadly beyond the scope of my programming skills, so I leave that to you.

-- Earle Martin http://downlode.org/ http://purl.org/net/earlemartin/ _______________________________________________ WikiEN-l mailing list WikiEN-l@Wikipedia.org To unsubscribe from this mailing list, visit: http://mail.wikipedia.org/mailman/listinfo/wikien-l

WikiEN-l mailing list WikiEN-l@Wikipedia.org To unsubscribe from this mailing list, visit: http://mail.wikipedia.org/mailman/listinfo/wikien-l

Chris Picone

8:54 a.m.

I don't know anything about Perl, but I get this foreign error trying to run the script in my ocmmand line:

Can't locate HTML/FormatText.pm in @INC (@INC contains: /sw/lib/perl5 /sw/lib/perl5/darwin /System/Library/Perl/5.8.6/darwin-thread-multi-2level /System/Library/Perl/5.8.6 /Library/Perl/5.8.6/darwin-thread-multi-2level /Library/Perl/5.8.6 /Library/Perl /Network/Library/Perl/5.8.6/darwin-thread-multi-2level /Network/Library/Perl/5.8.6 /Network/Library/Perl /System/Library/Perl/Extras/5.8.6/darwin-thread-multi-2level /System/Library/Perl/Extras/5.8.6 /Library/Perl/5.8.1/darwin-thread-multi-2level /Library/Perl/5.8.1 .) at copyviofinder.pl line 37. BEGIN failed--compilation aborted at copyviofinder.pl line 37.

hope it's of any use and thanks a lot! Chris On 11/26/06, Chris Picone ccool2ax@gmail.com wrote:

...

It's terrible. Many are obvious, but some are so subtle that I only found them through randomly Googling a phrase. (also, thanks a bundle for the Perl script. I got halfway there with Applescript, but now I don't have to worry!)

Chris

On 11/24/06, James Hare messedrocker@gmail.com wrote:

...
The fact that people continually insert copyvios into articles is very distressing.

On 11/24/06, Earle Martin wikipedia@downlode.org wrote:

...
On 23/11/06, Chris Picone ccool2ax@gmail.com wrote:

...
If I can't finish this, would any of you Wikipedians be so kind as to write a script for me to run? Thanks,

Sure. Try this: http://downlode.org/Code/Perl/Wikipedia/copyviofinder.txt

It's a little rough and ready, but seems to work. A couple of random examples:

[[USS Dale (DD-290)]] Search phrase: Dale operated with Destroyer Squadrons, Scouting Fleet, on the Atlantic Found: http://www.historycentral.com/navy/destroyer/DaleIIIdd290.html http://www.globalsecurity.org/military/agency/navy/dd-290.htm http://www.hazegray.org/danfs/destroy/dd290txt.htm

[[Cincinnati Bengals]] Search phrase: The Cincinnati Bengals are a professional American football team based Found: http://www.jazzsportsnews.com/nfl/teams/64.html

http://www.goupstate.com/apps/pbcs.dll/section?category=NEWS&template=wi... ati_Bengals http://www.acheapseat.com/cincinnati_bengals_tickets.html http://www.blinkbits.com/bits/viewforum/cincinnati_bengals_bio?f=2552 http://www.frontrowking.com/football/cincinnati_bengals_tickets.html http://www.ticketstogo.com/sport/nfl/Cincinnati-Bengals.html http://www.free-picks.org/nfl/articles/bengals.html http://www.foxboroticketking.com/vscincinnatibengals.html [lots more URLs snipped]

Whether the copyvio is an inward or outward bound one in each case is sadly beyond the scope of my programming skills, so I leave that to you.

-- Earle Martin http://downlode.org/ http://purl.org/net/earlemartin/ _______________________________________________ WikiEN-l mailing list WikiEN-l@Wikipedia.org To unsubscribe from this mailing list, visit: http://mail.wikipedia.org/mailman/listinfo/wikien-l

WikiEN-l mailing list WikiEN-l@Wikipedia.org To unsubscribe from this mailing list, visit: http://mail.wikipedia.org/mailman/listinfo/wikien-l

Earle Martin

27 Nov 27 Nov

3:45 a.m.

On 26/11/06, Chris Picone ccool2ax@gmail.com wrote:

...

I don't know anything about Perl, but I get this foreign error trying to run the script in my ocmmand line:

Can't locate HTML/FormatText.pm in @INC [...]

hope it's of any use and thanks a lot!

That means you don't have HTML::FormatText installed, which is a prerequisite module. You also need HTML::TreeBuilder and WWW::Mechanize. For information on installing Perl modules, see http://www.cpan.org/misc/cpan-faq.html#How_install_Perl_modules .

-- Earle Martin http://downlode.org/ http://purl.org/net/earlemartin/

Fastfission

21 Dec 21 Dec

10:13 a.m.

On 11/24/06, Earle Martin wikipedia@downlode.org wrote:

...

Whether the copyvio is an inward or outward bound one in each case is sadly beyond the scope of my programming skills, so I leave that to you.

I don't think this is a programming program -- its a conceptual problem.

A good copyvio bot -- one which doesn't waste one's time with false positives or outward copyvios -- would be one which monitors NEW additions and did not try to parse previously existing material. If someone says, "This is new, original text" but it gets Google hits, it is almost certainly copy-and-pasted (whether that makes it officially a copyvio still needs to be decided, but it is a vastly simpler problem than the previous one).

Trying to go through the entire database by finding random pages and taking random lines seems extremely hit-and-miss to me, and if you have to worry about mirrors and false positives then I can't see how that would possibly be productive. The odds of finding a copyvio are going to be quite low, and the amount of time needed to sort through them is going to be quite high. Monitoring RC for copyvio seems much simpler by comparison -- if finding previously-existing copyvios is going to be an impossible effort to automate successfully (which I think it is), preventing new copyvios would be comparatively easier.

(I started, ages ago, to work on a program which could do things like this, but got bogged down and lacked time. Sigh...)

Just my two cents... FF

geni

11:51 a.m.

On 12/21/06, Fastfission fastfission@gmail.com wrote:

...

On 11/24/06, Earle Martin wikipedia@downlode.org wrote:

...
Whether the copyvio is an inward or outward bound one in each case is sadly beyond the scope of my programming skills, so I leave that to you.

I don't think this is a programming program -- its a conceptual problem.

A good copyvio bot -- one which doesn't waste one's time with false positives or outward copyvios -- would be one which monitors NEW additions and did not try to parse previously existing material. If someone says, "This is new, original text" but it gets Google hits, it is almost certainly copy-and-pasted (whether that makes it officially a copyvio still needs to be decided, but it is a vastly simpler problem than the previous one).

This is already being done

...

Trying to go through the entire database by finding random pages and taking random lines seems extremely hit-and-miss to me, and if you have to worry about mirrors and false positives then I can't see how that would possibly be productive. The odds of finding a copyvio are going to be quite low, and the amount of time needed to sort through them is going to be quite high.

Daniel Brandt managed it.

-- geni

Matthew Brown

5:34 p.m.

On 12/21/06, geni geniice@gmail.com wrote:

...

On 12/21/06, Fastfission fastfission@gmail.com wrote:

...
Trying to go through the entire database by finding random pages and taking random lines seems extremely hit-and-miss to me, and if you have to worry about mirrors and false positives then I can't see how that would possibly be productive. The odds of finding a copyvio are going to be quite low, and the amount of time needed to sort through them is going to be quite high.

Daniel Brandt managed it.

Daniel Brandt is an obsessive with way too much time on his hands, it seems. Not a practical solution to the overall problem.

-Matt

James Hare

5:42 p.m.

C'mon, how not? People with too much time have the time to improve Wikipedia!

On 12/21/06, Matthew Brown morven@gmail.com wrote:

...

On 12/21/06, geni geniice@gmail.com wrote:

...
On 12/21/06, Fastfission fastfission@gmail.com wrote:

...
Trying to go through the entire database by finding random pages and taking random lines seems extremely hit-and-miss to me, and if you have to worry about mirrors and false positives then I can't see how that would possibly be productive. The odds of finding a copyvio are going to be quite low, and the amount of time needed to sort through them is going to be quite high.

Daniel Brandt managed it.

Daniel Brandt is an obsessive with way too much time on his hands, it seems. Not a practical solution to the overall problem.

-Matt _______________________________________________ WikiEN-l mailing list WikiEN-l@Wikipedia.org To unsubscribe from this mailing list, visit: http://mail.wikipedia.org/mailman/listinfo/wikien-l

Matthew Brown

22 Dec 22 Dec

2:02 a.m.

On 12/21/06, James Hare messedrocker@gmail.com wrote:

...

C'mon, how not? People with too much time have the time to improve Wikipedia!

I fear that level of obsession does not scale.

Or at least, I hope it does not.

-Matt

geni

21 Dec 21 Dec

6:57 p.m.

On 12/21/06, Matthew Brown morven@gmail.com wrote:

...

Daniel Brandt is an obsessive with way too much time on his hands, it seems. Not a practical solution to the overall problem.

-Matt

Yes and no. Certain searches of existing content would be useful the most obvious being running a copy of the database against a copy of britianica.

-- geni

Neil Harris

7:52 p.m.

geni wrote:

[snip]

...

Certain searches of existing content would be useful the most obvious being running a copy of the database against a copy of britianica.

And other databases of copyrighted texts, such as InfoTrac (http://www.gale.com/onefile/) or similar, and things like Google Book Search.

-- Neil

Neil Harris

8:09 p.m.

Neil Harris wrote:

...

geni wrote:

[snip]

...
Certain searches of existing content would be useful the most obvious being running a copy of the database against a copy of britianica.

And other databases of copyrighted texts, such as InfoTrac (http://www.gale.com/onefile/) or similar, and things like Google Book Search.

-- Neil

Just a thought: the en: Wikipedia gets about 3 edits a second. I wonder if it would be possible for us to use special pleading through the Foundation to get a dedicated search pipe into Google that would allow us to do, say, 30 searches a second 24 hours a day, (which would only be a tiny, tiny fraction of their overall capacity), in recognition of the _very_ substantial benefit in advertising revenue they must surely currently be receiving as a side effect of having Wikipedia's content online to draw in search queries.

(Think about it: even if only 20% of Wikimedia's 4000 or so page loads a second come from Google users who are expecting something like Wikipedia content, and Google only make $0.25 CPM on serving page ads on searches for those pages, that comes to an income stream of $0.20 per _second_ from Wikipedia searches, or a total of about $8M a year...)

If so, we could integrate the copyright violation bot into the toolserver, or into the MW server cluster itself.

-- Neil

Gregory Maxwell

8:23 p.m.

On 12/21/06, Neil Harris usenet@tonal.clara.co.uk wrote:

...

Just a thought: the en: Wikipedia gets about 3 edits a second. I wonder if it would be possible for us to use special pleading through the Foundation to get a dedicated search pipe into Google that would allow us to do, say, 30 searches a second 24 hours a day, (which would only be a tiny, tiny fraction of their overall capacity), in recognition of the _very_ substantial benefit in advertising revenue they must surely currently be receiving as a side effect of having Wikipedia's content online to draw in search queries.

(Think about it: even if only 20% of Wikimedia's 4000 or so page loads a second come from Google users who are expecting something like Wikipedia content, and Google only make $0.25 CPM on serving page ads on searches for those pages, that comes to an income stream of $0.20 per _second_ from Wikipedia searches, or a total of about $8M a year...)

If so, we could integrate the copyright violation bot into the toolserver, or into the MW server cluster itself.

Go ahead: Write the software, make it good, make it scale, make it robust so that you don't have to constantly twiddle with it to keep it working.

I have no doubt that Google's ratelimit can be worked out. I promise you that good work done towards these ends will not be work wasted. Make sure that it's sufficently modular that we'll be able to use it to generate queries against other texts sources.

The logic for software to do this well is not trivial but certainly not impossible. Working out the right access with Google is also not impossible. Someone just needs to step up an do it.

geni

8:28 p.m.

On 12/22/06, Gregory Maxwell gmaxwell@gmail.com wrote:

...

Go ahead: Write the software, make it good, make it scale, make it robust so that you don't have to constantly twiddle with it to keep it working.

http://en.wikipedia.org/wiki/User:Wherebot

...

I have no doubt that Google's ratelimit can be worked out. I promise you that good work done towards these ends will not be work wasted. Make sure that it's sufficently modular that we'll be able to use it to generate queries against other texts sources.

Other text sources for the most part would be best run offline against database dumps since most of them would involve runs agaist data that cannont be freely acessed on the web.

-- geni

Marc Riddell

8:37 p.m.

Do a search in Google; invariably the first reference you are given is Wikipedia. I don't have a clue what you are talking about technically, but I believe any agreement between Wikipedia and Google to improve service would be mutually beneficial.

Marc

...

From: "Gregory Maxwell" gmaxwell@gmail.com Reply-To: English Wikipedia wikien-l@Wikipedia.org Date: Thu, 21 Dec 2006 20:23:32 -0500 To: "English Wikipedia" wikien-l@wikipedia.org Subject: Re: [WikiEN-l] Copyright Violation Bot

On 12/21/06, Neil Harris usenet@tonal.clara.co.uk wrote:

...
Just a thought: the en: Wikipedia gets about 3 edits a second. I wonder if it would be possible for us to use special pleading through the Foundation to get a dedicated search pipe into Google that would allow us to do, say, 30 searches a second 24 hours a day, (which would only be a tiny, tiny fraction of their overall capacity), in recognition of the _very_ substantial benefit in advertising revenue they must surely currently be receiving as a side effect of having Wikipedia's content online to draw in search queries.

(Think about it: even if only 20% of Wikimedia's 4000 or so page loads a second come from Google users who are expecting something like Wikipedia content, and Google only make $0.25 CPM on serving page ads on searches for those pages, that comes to an income stream of $0.20 per _second_ from Wikipedia searches, or a total of about $8M a year...)

If so, we could integrate the copyright violation bot into the toolserver, or into the MW server cluster itself.

Go ahead: Write the software, make it good, make it scale, make it robust so that you don't have to constantly twiddle with it to keep it working.

I have no doubt that Google's ratelimit can be worked out. I promise you that good work done towards these ends will not be work wasted. Make sure that it's sufficently modular that we'll be able to use it to generate queries against other texts sources.

The logic for software to do this well is not trivial but certainly not impossible. Working out the right access with Google is also not impossible. Someone just needs to step up an do it. _______________________________________________ WikiEN-l mailing list WikiEN-l@Wikipedia.org To unsubscribe from this mailing list, visit: http://mail.wikipedia.org/mailman/listinfo/wikien-l

Andrew Gray

23 Dec 23 Dec

7:42 a.m.

On 22/12/06, Neil Harris usenet@tonal.clara.co.uk wrote:

...

geni wrote:

[snip]

...
Certain searches of existing content would be useful the most obvious being running a copy of the database against a copy of britianica.

And other databases of copyrighted texts, such as InfoTrac (http://www.gale.com/onefile/) or similar, and things like Google Book Search.

I wonder how practical it would be to find some way of running a comparison search of a Wikipedia dump against a Britannica/Encarta/etc CD? It could be done offline, and at leisure, for catching out past offenders - I suspect the worst cases are those we don't catch, and sit ignored and unrewritten for months. I know of at least one case where we were getting copy-pasted articles from the OxDNB...

-- - Andrew Gray andrew.gray@dunelm.org.uk

Chris Picone

24 Dec 24 Dec

12:18 a.m.

Most of the copyvios aren't from other encyclopedias.

Also, I get a copyvio 1 in 10 times... that is way, way too many. I'd have hundreds of edits if I ran this for hours a day.

People don't realize how serious a problem this is... I've been noticing this for months, and I want to give Earle Martin a million barnstars for writing this (it inspired me to start learning Perl so I can write an automated, newpages based one... hint hint)

--Chris

On 12/23/06, Andrew Gray shimgray@gmail.com wrote:

...

On 22/12/06, Neil Harris usenet@tonal.clara.co.uk wrote:

...
geni wrote:

[snip]

...
Certain searches of existing content would be useful the most obvious being running a copy of the database against a copy of britianica.

And other databases of copyrighted texts, such as InfoTrac (http://www.gale.com/onefile/) or similar, and things like Google Book Search.

I wonder how practical it would be to find some way of running a comparison search of a Wikipedia dump against a Britannica/Encarta/etc CD? It could be done offline, and at leisure, for catching out past offenders - I suspect the worst cases are those we don't catch, and sit ignored and unrewritten for months. I know of at least one case where we were getting copy-pasted articles from the OxDNB...

--

Andrew Gray andrew.gray@dunelm.org.uk

WikiEN-l mailing list WikiEN-l@Wikipedia.org To unsubscribe from this mailing list, visit: http://mail.wikipedia.org/mailman/listinfo/wikien-l

Earle Martin

5:31 p.m.

On 24/12/06, Chris Picone ccool2ax@gmail.com wrote:

...

People don't realize how serious a problem this is... I've been noticing this for months, and I want to give Earle Martin a million barnstars for writing this

Aw shucks!

...

(it inspired me to start learning Perl so I can write an automated, newpages based one... hint hint)

Flattery will get you nowhere... on the other hand, a programming challenge often will :) I may have a go at this one come the new year.

Merry Christmas or local/cultural equivalent!

Earle

-- Earle Martin http://downlode.org/ http://purl.org/net/earlemartin/

Fastfission

21 Dec 21 Dec

7:17 p.m.

On 12/21/06, geni geniice@gmail.com wrote:

...

...
Trying to go through the entire database by finding random pages and taking random lines seems extremely hit-and-miss to me, and if you have to worry about mirrors and false positives then I can't see how that would possibly be productive. The odds of finding a copyvio are going to be quite low, and the amount of time needed to sort through them is going to be quite high.

Daniel Brandt managed it.

Did he do it by using random pages? It strikes me that it would be something most easily done if you downloaded a copy of the database and then ran it off of that systematically (you could filter out short articles while you are at it).

geni

7:19 p.m.

On 12/22/06, Fastfission fastfission@gmail.com wrote:

...

Did he do it by using random pages? It strikes me that it would be something most easily done if you downloaded a copy of the database and then ran it off of that systematically (you could filter out short articles while you are at it).

FF

He search biographies. One of the challanges is building up a really complete set of wikipedia mirrors.

-- geni

Earle Martin

22 Dec 22 Dec

6:32 a.m.

On 21/12/06, Fastfission fastfission@gmail.com wrote:

...

The odds of finding a copyvio are going to be quite low.

Did you try actually running the program? I guess not.

Let's have another demonstration. I just ran the program a few times. On the fifth try it got me this:

http://en.wikipedia.org/wiki/Hilal-i-Jurat

which is copied wholesale from this copyrighted text:

http://faculty.winthrop.edu/haynese/medals/Pakistan/pakistan.htm

On the very next try, I got:

http://en.wikipedia.org/wiki/A._J._Seymour

which was either copied from, or to (either way, without attribution) this website:

http://www.triste-le-roi.blogspot.com/ajs_main.html

The revision history indicates the person responsible was the maintainer of the website. Six or seven more program runs later, I got:

http://en.wikipedia.org/wiki/Abd%C3%BClhak_H%C3%A2mid

which contains material copied from:

http://www.osmanli700.gen.tr/english/individuals/a19.html

My conclusion? First, the basic idea of the program is a sound one. Second, there are thousands more copyvios on the English Wikipedia than most people would ever imagine.

By the way, my program has a second use - you can also use it on a specific page if you're suspicious that it contains copyrighted text.

-- Earle Martin http://downlode.org/ http://purl.org/net/earlemartin/

geni

23 Nov 23 Nov

10:59 a.m.

On 11/23/06, Chris Picone ccool2ax@gmail.com wrote:

...

I've been seeing a rise in in-article copyvios. Last night I got one in [[Content managment system]]. I know that only some paragrahs have these copyvios, and not entire articles, so complete rewirtes aren't necessary. Thus, I'm attempting to write a script that (a) opens tabs with "Special:Random" on them (b) select the first setence from each paragraph (line break) (c) Google the sentence (d) If there are any exact matches not from en.wikipedia.org, put up a little message for me to check and remove the copyvio. (e) repeat.

Problem is, all I know is Applescript. If any of you Perl or pywikipedia or AWB-types have another way of writing this, can someone write it so the general community can use it to remove copyvios? (or is this possible with AWB?)

Chris (Ccool2ax)

Your really killer is that google limits the number of searches per IP. The other issues is the first senace tends to be the one most likely to be altured. Not saying it's imposible just rather hard to do.

-- geni

Chris Picone

11:05 a.m.

Yeah... first I have a dynamic IP so Google's not an issue (I use the https: server to connect to Wikipedia). Also, er... I've encountered one small developmental rpoblem: There aren't any browsers I know of that are very compatible with Applescript (except maybe Safari). I'll keep trying.

On 11/23/06, geni geniice@gmail.com wrote:

...

On 11/23/06, Chris Picone ccool2ax@gmail.com wrote:

...
I've been seeing a rise in in-article copyvios. Last night I got one in [[Content managment system]]. I know that only some paragrahs have these copyvios, and not entire articles, so complete rewirtes aren't necessary. Thus, I'm attempting to write a script that (a) opens tabs with "Special:Random" on them (b) select the first setence from each paragraph (line break) (c) Google the sentence (d) If there are any exact matches not from en.wikipedia.org, put up a little message for me to check and remove the copyvio. (e) repeat.

Problem is, all I know is Applescript. If any of you Perl or pywikipedia or AWB-types have another way of writing this, can someone write it so the general community can use it to remove copyvios? (or is this possible with AWB?)

Chris (Ccool2ax)

Your really killer is that google limits the number of searches per IP. The other issues is the first senace tends to be the one most likely to be altured. Not saying it's imposible just rather hard to do.

-- geni _______________________________________________ WikiEN-l mailing list WikiEN-l@Wikipedia.org To unsubscribe from this mailing list, visit: http://mail.wikipedia.org/mailman/listinfo/wikien-l

6576

Age (days ago)

6607

Last active (days ago)

wikien-l@lists.wikimedia.org

40 comments

17 participants

tags (0)

participants (17)

Alphax (Wikipedia email)
Andrew Gray
Chris Picone
Earle Martin
Fastfission
geni
Gregory Maxwell
James Hare
MacGyverMagic/Mgm
Marc Riddell
Matthew Brown
Michiel Sikma
Neil Harris
Ryan Wetherell
Sean Barrett
Sherool
Steve Bennett