Copyright infringement - The real elephant in the room - Wikimedia-l

List overview All Threads
Download

newer

Copyright infringement - The real elephant in the room

older

Spokeswoman Catrin Schoneville is...

September 11 wiki

James Heilman

13 Nov 2013 13 Nov '13

7:40 a.m.

The Wikimedia Foundation needs to wake up and deal with the "real tech elephant in the room". Our primary issue is not a lack of FLOW, a lack of a visual editor, or a lack of a rapidly expanding education program. Our biggest issue is copyright infringement. We have had the Indian program, we have had issues with the Education program, and I have today come across a user who has made nearly 20,000 edits to 1,742 article since 2006 which appear to be nearly all copy and pasted from the sources he has used. https://en.wikipedia.org/wiki/User_talk:DrMicro#Copyright_infringement This has seriously shaken my faith in Wikipedia. This is especially devastating as there is a tech solution that would have prevented it. The efforts are being worked on by volunteers here https://en.wikipedia.org/wiki/Wikipedia:Turnitin and has been since at least March of 2012. We NEED all tech resource at the foundation thrown at this project. Other less important project like FLOW and the visual editor need to be put on hold to develop this tool. -- James Heilman MD, CCFP-EM, Wikipedian The Wikipedia Open Textbook of Medicine www.opentextbookofmedicine.com

Show replies by date

Gerard Meijssen

13 Nov 13 Nov

8:34 a.m.

New subject: Copyright infringement - The real elephant in the room

Hoi, Seriously we should never ever be ruled be panic.What you see is bad, no doubt but the notion that we should dump everything because of the latest issue to come along is way overboard. - by stopping the flow on projects like Visual Editor you break dependencies for the work of many developers - what you have noticed is for only one Wikipedia not all of them - we do need more mature discussion software what we have is horrible - such dramatics only have you go away and upset others it does not solve things - the dramatics detract me from your message - my hobby horse needs more attention too and I think my argument is better ... Anyway, it would be nice when someone looks at the tool with an eye of making it happen and making it scale. When it doesn't it becomes a less attractive option to pursue. Thanks, GerardM On 13 November 2013 08:40, James Heilman <jmh649(a)gmail.com> wrote:

...

Matthew Flaschen

8:37 a.m.

On 11/13/2013 02:40 AM, James Heilman wrote:

...

I don't really agree with that. It is a serious issue, but I would put NPOV (in the face of active threats such as companies paying for publicity on Wikipedia) and growing the editor community higher. We also have solutions to address it (not perfectly, true), both preventing the problem and dealing with it after the fact * MadmanBot (https://en.wikipedia.org/wiki/User:MadmanBot) (mentioned at Wikipedia:TurnItIn, and a major technical tool against copyright infringement). * Clear policies against copyright infringement * Dealing with copyright violations (https://en.wikipedia.org/wiki/Wikipedia:Text_Copyright_Violations_101) * Finally, the DMCA ensures the foundation is not liable as long as they promptly respond to notifications (which of course we want them to anyway).

...

We have had the Indian program, we have had issues with the Education program, and I have today come across a user who has made nearly 20,000 edits to 1,742 article since 2006 which appear to be nearly all copy and pasted from the sources he has used. https://en.wikipedia.org/wiki/User_talk:DrMicro#Copyright_infringement This has seriously shaken my faith in Wikipedia.

That is indeed disturbing, and I'm glad you found it.

...

This is especially devastating as there is a tech solution that would have prevented it. The efforts are being worked on by volunteers here https://en.wikipedia.org/wiki/Wikipedia:Turnitin and has been since at least March of 2012. We NEED all tech resource at the foundation thrown at this project. Other less important project like FLOW and the visual editor need to be put on hold to develop this tool.

I don't agree that all tech resources should be used for this. However, there may be room for enhancing MadmanBot (e.g. as a GSOC or OPW project). A significant problem with TurnItIn is that is proprietary, and can not be customized by anyone in the movement. The fact that it is proprietary also means it can never be port of the main infrastructure, nor run on Wikimedia Labs. Matt Flaschen

Philippe Beaudette

10:16 a.m.

New subject: Copyright infringement - The real elephant in the room

On Wed, Nov 13, 2013 at 2:37 AM, Matthew Flaschen < matthew.flaschen(a)gatech.edu> wrote:

...

A significant problem with TurnItIn is that is proprietary, and can not be customized by anyone in the movement. The fact that it is proprietary also means it can never be port of the main infrastructure, nor run on Wikimedia Labs.

Another significant issue is the "False Positive" factor that is created by our overwhelming popularity. Frankly, we're mirrored all over the place. And tools like Turnitin find the mirrors too. It's not an easy problem to solve. I was on the team that looked at this a couple of years back - it's just not simple, and there are complex challenges. *Philippe Beaudette * \\ Director, Community Advocacy \\ Wikimedia Foundation, Inc. T: 1-415-839-6885 x6643 | philippe(a)wikimedia.org | : @Philippewiki<https://twitter.com/Philippewiki>

Matthew Flaschen

10:23 a.m.

On 11/13/2013 05:16 AM, Philippe Beaudette wrote:

...

On Wed, Nov 13, 2013 at 2:37 AM, Matthew Flaschen < matthew.flaschen(a)gatech.edu> wrote:

Gerard Meijssen

10:44 a.m.

New subject: Copyright infringement - The real elephant in the room

Hoi I know several authors who publish and use their original text to publish on Wikipedia as well.. This is another source of false positives because they have the copyright to the original source... To recognise this you have to be even more sophisticated. The point I want to make is that having a tool that is KNOWN to be deficient in specific ways can still be a huge advantage over not having a tool at all. So PLEASE lets not make perfection the enemy of the good. Thanks, GerardM On 13 November 2013 11:23, Matthew Flaschen <matthew.flaschen(a)gatech.edu>wrote;wrote:

...

On 11/13/2013 05:16 AM, Philippe Beaudette wrote:

On Wed, Nov 13, 2013 at 2:37 AM, Matthew Flaschen < matthew.flaschen(a)gatech.edu> wrote: A significant problem with TurnItIn is that is proprietary, and can not

be customized by anyone in the movement. The fact that it is proprietary also means it can never be port of the main infrastructure, nor run on Wikimedia Labs.

Yes, an intelligent solution would take into account when the mirror was first indexed (or ideally first published), and when the Wikipedia article was edited, to reduce false positives requiring manual intervention. Matt Flaschen _______________________________________________ Wikimedia-l mailing list Wikimedia-l(a)lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, <mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe>

Marco Chiesa

11:26 a.m.

New subject: Copyright infringement - The real elephant in the room

On Wed, Nov 13, 2013 at 11:44 AM, Gerard Meijssen <gerard.meijssen(a)gmail.com

...

wrote:

...

Actually, we consider these as copyvios, we delete the text straight away, and we tell the editor "if you're the author write to OTRS". Of course, if the text is already somewhere else under a compatible free-license, we don't need this. Until you can't be sure that User:MrX is actually the physical person MrX, we need to protect the author's right. Marco

Chris McKenna

11:36 a.m.

On Wed, 13 Nov 2013, Marco Chiesa wrote:

...

On Wed, Nov 13, 2013 at 11:44 AM, Gerard Meijssen <gerard.meijssen(a)gmail.com

wrote:

But an automated tool can not know whether OTRS verification has happened or not. ---- Chris McKenna cmckenna(a)sucs.org www.sucs.org/~cmckenna The essential things in life are seen not with the eyes, but with the heart Antoine de Saint Exupery

Marco Chiesa

11:40 a.m.

New subject: Copyright infringement - The real elephant in the room

On Wed, Nov 13, 2013 at 12:36 PM, Chris McKenna <cmckenna(a)sucs.org> wrote:

...

But an automated tool can not know whether OTRS verification has happened

or not. We put something like {{OTRS verified}} in the article's talk page,

something saying: Part of the text comes from website X, ticket 1234567890. And if the author wants to use his work for many articles, we tell him/her to put the template in all his/her articles' talk page. Marco

Chris McKenna

11:39 a.m.

On Wed, 13 Nov 2013, Gerard Meijssen wrote:

...

The point I want to make is that having a tool that is KNOWN to be deficient in specific ways can still be a huge advantage over not having a tool at all. So PLEASE lets not make perfection the enemy of the good.

The problem isn't that we're waiting for perfection. We're waiting for the proportion of false positives and false negatives to fall to a level where don't overwhelm the true positives. ---- Chris McKenna cmckenna(a)sucs.org www.sucs.org/~cmckenna The essential things in life are seen not with the eyes, but with the heart Antoine de Saint Exupery

Marco Chiesa

11:46 a.m.

New subject: Copyright infringement - The real elephant in the room

On Wed, Nov 13, 2013 at 12:39 PM, Chris McKenna <cmckenna(a)sucs.org> wrote:

...

The problem isn't that we're waiting for perfection. We're waiting for the proportion of false positives and false negatives to fall to a level where don't overwhelm the true positives.

rupert THURNER

14 Nov 14 Nov

1:36 p.m.

New subject: Copyright infringement - The real elephant in the room

There is such a case in http://en.m.wikipedia.org/wiki/Education_in_Cameroon, reference is on the talk page. would you be so kind to mark or refer to it correctly? rupert Am 13.11.2013 12:46 schrieb "Marco Chiesa" <chiesa.marco(a)gmail.com>om>:

...

On Wed, Nov 13, 2013 at 12:39 PM, Chris McKenna <cmckenna(a)sucs.org> wrote:

The problem isn't that we're waiting for perfection. We're waiting for

the

proportion of false positives and false negatives to fall to a level

where

don't overwhelm the true positives.

To avoid false positives from mirrors, the best option is to compare a text as soon as it is saved. Also, you exclude certain websites from the comparison because you know they're the mirrors, you exclude rollbacks, ... Then, it is better to have a human checking that it is really a copyvio (it could well be a public domain text, or another Wikipedia article). Marco _______________________________________________ Wikimedia-l mailing list Wikimedia-l(a)lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, <mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe>

Florence Devouard

15 Nov 15 Nov

12:44 p.m.

New subject: Copyright infringement - The real elephant in the room

Hmmmm Rupert, The case you mention is unrelated to any copyright infringement (the book is explicitely published under cc by sa. So there is no copyvio). Its mention here is like hair falling in soup. Now, I think there is a developing personal feud between you and Iolenda. It sincerely saddens me to see two people I appreciate come to such a situation. Would you both consider talking to each other on Skype or something like this ? Alternatively, find someone neutral and nice to help fix things so that you can come to a mutual understanding ? I understand that you both see things differently, but ultimately, you both are here to make things move on. Flo On 11/14/13 2:36 PM, rupert THURNER wrote:

...

On Wed, Nov 13, 2013 at 12:39 PM, Chris McKenna <cmckenna(a)sucs.org> wrote:

The problem isn't that we're waiting for perfection. We're waiting for

the

proportion of false positives and false negatives to fall to a level

where

don't overwhelm the true positives.

_______________________________________________ Wikimedia-l mailing list Wikimedia-l(a)lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, <mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe>

rupert THURNER

7:36 p.m.

New subject: Copyright infringement - The real elephant in the room

Salut florence, i obviously need to improve my English :) Marco suggested human checking to avoid false positives and some annotation that it happened. In my eyes the cited case is a verbatim copy of some compatible license text which could be used as an example to demonstrate what he ment. I did not see such a thing up to now and would not be 100% sure how to do it correctly. So i asked. Rupert Am 15.11.2013 13:44 schrieb "Florence Devouard" <anthere9(a)yahoo.com>om>:

...

There is such a case in http://en.m.wikipedia.org/ wiki/Education_in_Cameroon, reference is on the talk page. would you be so kind to mark or refer to it correctly? rupert Am 13.11.2013 12:46 schrieb "Marco Chiesa" <chiesa.marco(a)gmail.com>om>: On Wed, Nov 13, 2013 at 12:39 PM, Chris McKenna <cmckenna(a)sucs.org>

wrote:

The problem isn't that we're waiting for perfection. We're waiting for

the

proportion of false positives and false negatives to fall to a level

where

don't overwhelm the true positives. To avoid false positives from mirrors, the best option is to compare a

text as soon as it is saved. Also, you exclude certain websites from the comparison because you know they're the mirrors, you exclude rollbacks, ... Then, it is better to have a human checking that it is really a copyvio (it could well be a public domain text, or another Wikipedia article). Marco _______________________________________________ Wikimedia-l mailing list Wikimedia-l(a)lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, <mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe>

Anthony Cole

16 Nov 16 Nov

2:04 p.m.

New subject: Copyright infringement - The real elephant in the room

The problem of false positives from mirrors doesn't exist if we scan edits as they are made. Maggie says here<https://en.wikipedia.org/wiki/Wikipedia:Administrators%27_noticeboa… copyright bots populate WP:SCV <https://en.wikipedia.org/wiki/Wikipedia:SCV> So a similarly-configured bot could scan recent changes and tag suspected copyvios in watchlists and page histories like suspected vandalism is currently tagged. Ideally the edit summary would contain a url to the suspected source. Maggie points out that those copyright bots were blocked for a while (from scanning Google I presume) due to a negative impact on Google, and this problem was solved by someone writing a cheque. False positives won't be a problem, unless they're more than, say, 50%. If a recent changes patroller can't confirm the copyvio, they can let it go. But an editor whose "contributions" list is peppered with such warnings would stand out like a sore thumb. <https://en.wikipedia.org/wiki/Wikipedia:SCV> Anthony Cole <http://en.wikipedia.org/wiki/User_talk:Anthonyhcole> Memberships secretary Wiki Project Med Foundation<http://meta.wikimedia.org/wiki/Wiki_Project_Med> On Sat, Nov 16, 2013 at 3:36 AM, rupert THURNER <rupert.thurner(a)gmail.com>wrote;wrote:

...

help

fix things so that you can come to a mutual understanding ? I understand that you both see things differently, but ultimately, you both are here to make things move on. Flo On 11/14/13 2:36 PM, rupert THURNER wrote: > There is such a case in http://en.m.wikipedia.org/ > wiki/Education_in_Cameroon, > reference is on the talk page. would you be so kind to mark or refer to

> correctly? > > rupert > Am 13.11.2013 12:46 schrieb "Marco Chiesa" <chiesa.marco(a)gmail.com>om>: > > On Wed, Nov 13, 2013 at 12:39 PM, Chris McKenna <cmckenna(a)sucs.org> >> wrote: >> >> >>> The problem isn't that we're waiting for perfection. We're waiting for >>> >> the >> >>> proportion of false positives and false negatives to fall to a level >>> >> where >> >>> don't overwhelm the true positives. >>> >>> >>> To avoid false positives from mirrors, the best option is to compare

Matthew Flaschen

19 Nov 19 Nov

1:07 a.m.

On 11/16/2013 09:04 AM, Anthony Cole wrote:

...

The problem of false positives from mirrors doesn't exist if we scan edits as they are made.

Agreed. However, that example is a legal, attributed (at least on the talk page) copy from a third-party freely licensed text, not a false positive copy from a Wikipedia mirror.

...

Maggie says here<https://en.wikipedia.org/wiki/Wikipedia:Administrators%27_noticeboa… copyright bots populate WP:SCV <https://en.wikipedia.org/wiki/Wikipedia:SCV> So a similarly-configured bot could scan recent changes and tag suspected copyvios in watchlists and page histories like suspected vandalism is currently tagged.

The suspected vandalism checks that actually tag the edit (e.g. "Tag: possible vandalism") are based on AbuseFilter checks. These are relatively fast determinations that consider the text of the edit (e.g. regexes for strings of curse words, or meaningless repeating characters), and comparisons to the previous version (blanked the section, blanked the page). As far as I know, regular AbuseFilter rules can not hit a database or web search to check for copyright violations. An extension could in theory do this. But there would possibly be performance problems, since AbuseFilter runs on the actual server (not just some bot's computer) on every edit. It is possible for a bot to scan every edit; it just can't use AbuseFilter tags. Matt Flaschen

Andrew Gray

1:12 p.m.

New subject: Copyright infringement - The real elephant in the room

It could use abuse-filter tags, just not in an entirely standard way: * Bot scans edit X * Script flags it as a problem * Bot makes edit X+1 to page (perhaps adding copyvio template?) which triggers an abusefilter rule for (if this bot and does such-and-such an edit) and tags it. The offending edit itself won't be tagged, but the page history will and it can probably be spotted quite easily from there. A. On 19 November 2013 01:07, Matthew Flaschen <mflaschen(a)wikimedia.org> wrote:

...

On 11/16/2013 09:04 AM, Anthony Cole wrote:

The problem of false positives from mirrors doesn't exist if we scan edits as they are made.

Agreed. However, that example is a legal, attributed (at least on the talk page) copy from a third-party freely licensed text, not a false positive copy from a Wikipedia mirror.

-- - Andrew Gray andrew.gray(a)dunelm.org.uk

Quim Gil

13 Nov 13 Nov

5:44 p.m.

On 11/13/2013 12:37 AM, Matthew Flaschen wrote:

...

However, there may be room for enhancing MadmanBot (e.g. as a GSOC or OPW project).

Any technical project able to identify small tasks and mentors available are welcome to join Wikimedia's Google Code-in team at https://www.mediawiki.org/wiki/Google_Code-In GCI will start next week and will last until the beginning of January. Hundreds of young students will scan our tasks and will eventually complete some of them. It is a program ideal for small projects, like the bots or gadgets used by editors. -- Quim Gil Technical Contributor Coordinator @ Wikimedia Foundation http://www.mediawiki.org/wiki/User:Qgil

Steven Walling

9:03 a.m.

New subject: Copyright infringement - The real elephant in the room

On Tue, Nov 12, 2013 at 11:40 PM, James Heilman <jmh649(a)gmail.com> wrote:

...

Andrew Lih

14 Nov 14 Nov

3:47 p.m.

New subject: Copyright infringement - The real elephant in the room

FYI, on the last Wikipedia Weekly podcast, we talked with Sage Ross about the plagiarism issue, and he walked through the study with some very interesting insights. Video here, and the discussion started at 11 minutes, 30 seconds into the podcast. https://www.youtube.com/watch?v=IOgYytn2JRk -Andrew On Wed, Nov 13, 2013 at 4:03 AM, Steven Walling <steven.walling(a)gmail.com>wrote;wrote:

...

On Tue, Nov 12, 2013 at 11:40 PM, James Heilman <jmh649(a)gmail.com> wrote:

The Wikimedia Foundation needs to wake up and deal with the "real tech elephant in the room". Our primary issue is not a lack of FLOW, a lack

of a

visual editor, or a lack of a rapidly expanding education program. Our biggest issue is copyright infringement. We have had the Indian program, we have had issues with the Education program, and I have today come across a user who has made nearly 20,000 edits to 1,742 article

since

2006 which appear to be nearly all copy and pasted from the sources he

has

used. https://en.wikipedia.org/wiki/User_talk:DrMicro#Copyright_infringement This has seriously shaken my faith in Wikipedia. This is especially devastating as there is a tech solution that would

have

prevented it. The efforts are being worked on by volunteers here https://en.wikipedia.org/wiki/Wikipedia:Turnitin and has been since at least March of 2012. We NEED all tech resource at the foundation thrown

this project. Other less important project like FLOW and the visual

editor

need to be put on hold to develop this tool.

Relevant info on the subject of copyvio is the recent plagiarism study by the Education Program team. They looked different types of users (students, newbies, experienced editors, admins) and compared them. Results were published on Meta at https://meta.wikimedia.org/wiki/Research:Plagiarism_on_the_English_Wikipedi… also discussed in the last WMF Metrics & Activities meeting: https://meta.wikimedia.org/wiki/Metrics_and_activities_meetings/2013-11-07 AFAIK this is the best data we have about how often different kinds of editors close paraphrase or outright copy/paste. _______________________________________________ Wikimedia-l mailing list Wikimedia-l(a)lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, <mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe>

Laura Hale

4:15 p.m.

New subject: Copyright infringement - The real elephant in the room

On Thu, Nov 14, 2013 at 4:47 PM, Andrew Lih <andrew.lih(a)gmail.com> wrote:

...

I've done a study of student contributions on English Wikinews. We've found that only about 15% of student submissions have a copyright issue. The level of plagiarism and copyright problems is about the same for regular contributors, new contributors and student contributors on English Wikinews with that range of 10 to 15%. This is an issue we have to be on top of because nothing gets published on the project without being reviewed for this issue. Sincerely, Laura Hale -- twitter: purplepopple blog: ozziesport.com

Marco Chiesa

13 Nov 13 Nov

9:21 a.m.

New subject: Copyright infringement - The real elephant in the room

On Wed, Nov 13, 2013 at 8:40 AM, James Heilman <jmh649(a)gmail.com> wrote:

...

Our biggest issue is copyright infringement. We have had the Indian program, we have had issues with the Education program, and I have today come across a user who has made nearly 20,000 edits to 1,742 article since 2006 which appear to be nearly all copy and pasted from the sources he has used. https://en.wikipedia.org/wiki/User_talk:DrMicro#Copyright_infringement This has seriously shaken my faith in Wikipedia.

Back in 2007 we found out a user on it.wp, a former sysop, with more than 40,000 edits that used to copy-paste from his sources, often outdated. He was banned, and the community made a great effort to cleanup the articles he contributed to (and damn it was hard, because those articles had a long history after his edits). And in the following years, we had other similar cases, you can find a selection here: https://it.wikipedia.org/wiki/Progetto:Cococo/Controlli_conclusi There are bots that go and look whether a newly inserted block of text is already present somewhere else, it doesn't find everything (of course it won't find things copied from a printed book), but sooner or later serial copyviolers get caught, and the fall from hero to zero is sooo quick. At the end of the day, I think copyvios have always been taken seriously, so that I don't remember big problems with that, while there have always been more problems with libel, privacy, and editor retention. Marco (Cruccone)

Lodewijk

9:53 a.m.

New subject: Copyright infringement - The real elephant in the room

Marco: I agree, we had also issues on the Dutch Wikipedia - these have been around for ages, the English Wikipedia is just less aware of them. Often, copypasting in the same language is caught easily - between different languages is much harder and persistent. There are many people, including experienced editors, that think translating from random sources is OK. It is no new problem, and chapters have indeed been working on getting this understanding of what free licenses really mean more widely accepted in the general audience. Not something that is easily measured of course. Technical solutions sound great, but are only catching a small amount inside the same language. Steven: I understand this research was limited to the English Wikipedia (where most of the plagiarism will be in the same language). It would not strike me out of the realm of realism to assume this might be very different for other languages than English. It also says little about the problem in general of course. For those who don't want to click on links to get information, it basically says (simplification alert) that they don't have any indication that the US & Canada education program makes the plagiarism problem on the English Wikipedia any worse than it already is. Anyway: I think this problem is more prominently there in non-English communities, and that technical solutions are not going to be the answer there. An educational answer is more likely to be successful, focusing on explaining people how Wikipedia works and doesn't work, and what are do's and don'ts. This doesn't have to be an education program like executed in the US, but basically all outreach programs as executed by chapters, user groups, thematic organizations or groups of volunteers can contribute to this. This is already happening in most countries. In some countries (like Germany ;-) ) politicians are doing the work for us, explaining how evil plagiarism is and how it works by firing government ministers over it :) Best, Lodewijk 2013/11/13 Marco Chiesa <chiesa.marco(a)gmail.com>

...

On Wed, Nov 13, 2013 at 8:40 AM, James Heilman <jmh649(a)gmail.com> wrote:

since

2006 which appear to be nearly all copy and pasted from the sources he

has

used. https://en.wikipedia.org/wiki/User_talk:DrMicro#Copyright_infringement This has seriously shaken my faith in Wikipedia.

Nathan

6:39 p.m.

New subject: Copyright infringement - The real elephant in the room

On Wed, Nov 13, 2013 at 4:53 AM, Lodewijk <lodewijk(a)effeietsanders.org>wrote;wrote:

...

Marco: I agree, we had also issues on the Dutch Wikipedia - these have been around for ages, the English Wikipedia is just less aware of them.

Not sure if you meant this how it sounds, but the English Wikipedia community is acutely aware of copyright problems and have undertaken many, many large and complicated cleanup tasks of the sort Marco described.

Michael Snow

6:48 p.m.

On 11/13/2013 10:39 AM, Nathan wrote:

...

On Wed, Nov 13, 2013 at 4:53 AM, Lodewijk <lodewijk(a)effeietsanders.org>wrote;wrote:

Marco: I agree, we had also issues on the Dutch Wikipedia - these have been around for ages, the English Wikipedia is just less aware of them.

I think he meant that the English Wikipedia community is less aware of the fact that we face these sorts of large-scale challenges in many other languages as well. In other words, the antecedent to "them" is "issues on the Dutch/Italian/etc. Wikipedia", rather than "copyright issues" generally. Most people participating in other languages are reasonably aware when major concerns surface from the English Wikipedia; people participating only in English often haven't a clue about the concerns being dealt with in other languages. --Michael Snow

Nathan

8:39 p.m.

New subject: Copyright infringement - The real elephant in the room

On Wed, Nov 13, 2013 at 1:48 PM, Michael Snow <wikipedia(a)frontier.com>wrote;wrote:

...

On 11/13/2013 10:39 AM, Nathan wrote:

On Wed, Nov 13, 2013 at 4:53 AM, Lodewijk <lodewijk(a)effeietsanders.org> wrote:

Marco: I agree, we had also issues on the Dutch Wikipedia - these have been around for ages, the English Wikipedia is just less aware of them.

That makes sense, thanks for clearing that up for me.

Federico Leva (Nemo)

9:57 a.m.

Marco Chiesa, 13/11/2013 10:21:

...

There are bots that go and look whether a newly inserted block of text is already present somewhere else, [...]

Rectius: there *used* to be a bot (RevertBot, Lusumbot). The program <https://www.mediawiki.org/wiki/Manual:Pywikibot/copyright.py> has been stopped when search engines changed their limits and Lusum has been waiting for the WMF's Yahoo! BOSS key, needed to run the bot, for a while. Nemo

Marc A. Pelletier

16 Nov 16 Nov

3:34 p.m.

On 11/13/2013 04:57 AM, Federico Leva (Nemo) wrote:

...

I haven't been "in charge" of that key in quite some time, but I think I still have the apropriate credentials to generate one for a copyright violation bot. I can look into it if you want. -- Marc

Federico Leva (Nemo)

6:29 p.m.

Marc A. Pelletier, 16/11/2013 16:34:

...

On 11/13/2013 04:57 AM, Federico Leva (Nemo) wrote:

I haven't been "in charge" of that key in quite some time, but I think I still have the apropriate credentials to generate one for a copyright violation bot. I can look into it if you want.

It would be awesome! Thank you. Nemo

Matthew Flaschen

20 Nov 20 Nov

5:05 a.m.

On 11/13/2013 04:57 AM, Federico Leva (Nemo) wrote:

...

Marco Chiesa, 13/11/2013 10:21:

There are bots that go and look whether a newly inserted block of text is already present somewhere else, [...]

https://en.wikipedia.org/wiki/User:MadmanBot is still running on English Wikipedia, which uses the same Yahoo APIs (http://www.uberbox.org/~marc/csb.pl). It might be possible to run it on Italian Wikipedia as well, even without generating a new key. The operator seems to be https://en.wikipedia.org/wiki/User:Madman Matt Flaschen

Federico Leva (Nemo)

7:12 a.m.

Matthew Flaschen, 20/11/2013 06:05:

...

On 11/13/2013 04:57 AM, Federico Leva (Nemo) wrote:

Marco Chiesa, 13/11/2013 10:21:

There are bots that go and look whether a newly inserted block of text is already present somewhere else, [...]

That bot links a code (Coren's) that asks a key. So either the user is paying one himself, or he got the WMF's one some time ago: in both cases, he can't give it to more people. Nemo

Fæ

13 Nov 13 Nov

11:48 a.m.

New subject: Copyright infringement - The real elephant in the room

On 13 November 2013 07:40, James Heilman <jmh649(a)gmail.com> wrote: ...

...

Our biggest issue is copyright infringement.

... Thanks for raising this James. Yes, this is an issue but if you are gunning for elephants this month, I really don't think the copyright elephant is the biggest one in the herd. As a practical example of the tools we already have in place, yesterday I was facilitating an edit-a-thon for women in science with King's College London and we had one of the example stubs we had created on the English Wikipedia up on a projector. Within literally *minutes* of creation it had been (correctly) flagged by a bot as a possible copyright violation as some of the text had been cut & past from King's own website; one of the participants quickly re-wrote it using their own words. As the communications manager was sitting next to me at the time, no doubt she found this rather reassuring, even though in parallel she was asking about how best to "officially" release text. :-) We have a more complex problem with how images uploaded to Wikimedia Commons can be flagged where they match images found elsewhere on the internet, this is something that may be done by a future bot but we might need to partner with someone like Google Images or Tineye to make this truly effective. Having run my own experimental bots on this area, I would love to see this become a funded project. PS with regard to OTRS verification, we could do with better standards for verification, at the moment volunteers like myself are left to use our own judgement about what checks to make. I tend to double check text or images being released with Google, just in case, as well as doing "whois" checks on email domains. These sorts of checks could become part of OTRS guidelines and would make the reliability of OTRS tickets a notch higher. Cheers, Fae -- faewik(a)gmail.com http://j.mp/faewm

George Herbert

6:40 p.m.

New subject: Copyright infringement - The real elephant in the room

On Wed, Nov 13, 2013 at 3:48 AM, Fæ <faewik(a)gmail.com> wrote:

...

... PS with regard to OTRS verification, we could do with better standards for verification,

We are not attempting to perform a complete and unassailable verification; imagining that we can is folly. The point is, we need someone who credibly is the author or rightsholder, and with whom we have an audit trail of their claims and identity (email address we corresponded with, etc). When it comes down to it, we have no idea if an email is associated with the given person, that the alleged sender of a certified letter really is that person, or that the "John Doe" that came in to the office and showed valid government issued ID with a claim of copyright violation is the same John Doe who wrote the original material. There's no way for us to confirm in any reasonable manner. If there is an attempt at identity theft that is discovered, that audit trail is available to investigators with proper legal authorization etc. -- -george william herbert george.herbert(a)gmail.com

Samuel Klein

19 Nov 19 Nov

8:44 p.m.

New subject: Copyright infringement - The real elephant in the room

Aside @Fae: the tineye crew are curious & quite pro-freeculture, I bet they would be glad to help design a bot that uses their API to check image copyvios. On Nov 13, 2013 6:48 AM, "Fæ" <faewik(a)gmail.com> wrote:

...

On 13 November 2013 07:40, James Heilman <jmh649(a)gmail.com> wrote: ...

Our biggest issue is copyright infringement.

Federico Leva (Nemo)

10:12 p.m.

Samuel Klein, 19/11/2013 21:44:

...

Aside @Fae: the tineye crew are curious & quite pro-freeculture, I bet they would be glad to help design a bot that uses their API to check image copyvios.

How to make them include the whole Commons dataset into their own, to start with? Nemo

Fæ

20 Nov 20 Nov

10:10 a.m.

New subject: Copyright infringement - The real elephant in the room

On 19 November 2013 20:44, Samuel Klein <meta.sj(a)gmail.com> wrote:

...

Aside @Fae: the tineye crew are curious & quite pro-freeculture, I bet they would be glad to help design a bot that uses their API to check image copyvios.

The Cunctator

12:13 p.m.

New subject: Copyright infringement - The real elephant in the room

Yes, let's keep on pushing for policies that drive away editors! On Nov 20, 2013 2:10 AM, "Fæ" <faewik(a)gmail.com> wrote:

...

On 19 November 2013 20:44, Samuel Klein <meta.sj(a)gmail.com> wrote:

Aside @Fae: the tineye crew are curious & quite pro-freeculture, I bet

they

would be glad to help design a bot that uses their API to check image copyvios.

This is an area this spins off from my little experiments with better management of uploads to Commons from mobile devices. I would like to look at this again and perhaps get a funding proposal together (or partnership with Tineye if they are up for it), It is one of several creative back-burner volunteer projects that I hope to have time to dig into again next year. Fae _______________________________________________ Wikimedia-l mailing list Wikimedia-l(a)lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, <mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe>

Martijn Hoekstra

1:27 p.m.

New subject: Copyright infringement - The real elephant in the room

On Nov 20, 2013 1:13 PM, "The Cunctator" <cunctator(a)gmail.com> wrote:

...

Yes, let's keep on pushing for policies that drive away editors!

I'm not sure exactly what kind of policy you are getting at here. Could you elaborate a little?

...

On Nov 20, 2013 2:10 AM, "Fæ" <faewik(a)gmail.com> wrote:

On 19 November 2013 20:44, Samuel Klein <meta.sj(a)gmail.com> wrote:

Aside @Fae: the tineye crew are curious & quite pro-freeculture, I bet

they

would be glad to help design a bot that uses their API to check image copyvios.

_______________________________________________ Wikimedia-l mailing list Wikimedia-l(a)lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,

<mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe>

Marc A. Pelletier

4:31 p.m.

On 11/20/2013 07:13 AM, The Cunctator wrote:

...

Yes, let's keep on pushing for policies that drive away editors!

Michael Snow

4:59 p.m.

On 11/20/2013 8:31 AM, Marc A. Pelletier wrote:

...

On 11/20/2013 07:13 AM, The Cunctator wrote:

Yes, let's keep on pushing for policies that drive away editors!

Not that I encourage us to be permissive about copyright infringement, but there are two potential aspects here. You've touched on the first, which is contributors who do the copying - if they are willing to change, that's fine, although I'm skeptical about the value of editors who "don't know any better" and certainly repeat offenders should be highly unwelcome. But the second aspect is the loss of tasks other editors may be able to participate in, if there's potential for overautomation of the review process and corresponding loss of human judgment (What needs to be removed and what could be fixed just by citing the source? How thorough a rewrite is necessary to avoid plagiarizing source text?). An essential part of collaboration is, after all, reviewing each other's work. From the terseness of the comment, it might be alluding to either aspect or both. --Michael Snow

Marc A. Pelletier

5:20 p.m.

On 11/20/2013 11:59 AM, Michael Snow wrote:

...

An essential part of collaboration is, after all, reviewing each other's work. From the terseness of the comment, it might be alluding to either aspect or both.

That's actually an interesting question that has been lurking beneath all the "editing is going down" nervousness. How much of that 'editing' was, in fact, busy work made immaterial by technical advantage (bots, extensions, abusefilter)? The number of antivandalism edits a /human/ has to do in a day has most certainly come down a *lot* since c. 2006; this no doubt contributed to a large - now diminishing - fraction of total edits. It's not clear to me that the number of *productive* edits has been going down all that much (if at all) in the past several years; the proportion of edits that were tedious and repetitive clearly has. Are you arguing that there is *value* in volunteers spending time on work that could be automated? Except for artificially driving up edit counts, that is time (and effort) that would be better spent pretty much anywhere else! -- Marc

Richard Symonds

6:06 p.m.

New subject: Copyright infringement - The real elephant in the room

Not quite: I would argue that anti-vandalism work is a "gateway drug" to the rest of the project. Just a hunch, though. On Nov 20, 2013 5:21 PM, "Marc A. Pelletier" <marc(a)uberbox.org> wrote:

...

On 11/20/2013 11:59 AM, Michael Snow wrote:

An essential part of collaboration is, after all, reviewing each other's work. From the terseness of the comment, it might be alluding to either aspect or both.

Marc A. Pelletier

6:45 p.m.

On 11/20/2013 01:06 PM, Richard Symonds wrote:

...

Not quite: I would argue that anti-vandalism work is a "gateway drug" to the rest of the project. Just a hunch, though.

I'm pretty sure that typo correction fills pretty much the same niche, though. -- Marc

Benjamin Lees

21 Nov 21 Nov

7:23 p.m.

New subject: Copyright infringement - The real elephant in the room

On Wed, Nov 20, 2013 at 1:45 PM, Marc A. Pelletier <marc(a)uberbox.org> wrote:

...

On 11/20/2013 01:06 PM, Richard Symonds wrote:

Not quite: I would argue that anti-vandalism work is a "gateway drug" to the rest of the project. Just a hunch, though.

I'm pretty sure that typo correction fills pretty much the same niche, though. -- Marc

If we run out of typos and vandalism, we can always write bots to introduce more. :) I saw a great essay yesterday that ties in to both the social and the technical aspects of this discussion: https://en.wikipedia.org/w/index.php?title=User_talk:Protonk&oldid=5825…ress...

Matthew Flaschen

22 Nov 22 Nov

12:45 a.m.

New subject: Draft namespace (WAS: Copyright infringement - The real elephant in the room)

On 11/21/2013 02:23 PM, Benjamin Lees wrote:

...

I saw a great essay yesterday that ties in to both the social and the technical aspects of this discussion: https://en.wikipedia.org/w/index.php?title=User_talk:Protonk&oldid=5825…ress...

Re the "If a drafts namespace is created, special case the act of moving a draft to mainspace. Make a publish button or something. [...]" paragraph in that essay, that's the kind of thing we're looking at on the Growth team right now. See https://www.mediawiki.org/wiki/Wikipedia_article_creation Matt Flaschen

Michael Snow

20 Nov 20 Nov

6:13 p.m.

On 11/20/2013 9:20 AM, Marc A. Pelletier wrote:

...

A lot of work that gets automated is not necessarily difficult for humans, just time-consuming. But volunteer time is not a resource we get to allocate or control; the volunteers do. Simple tasks can help recruit or retain contributors--providing a way to ease people into participation, or a break to prevent burnout between tackling more challenging projects. And while that time and effort might appear more "valuable" if spent on other tasks, there's no guarantee that it in fact would be. For tasks that most contributors find unpleasant (dealing with certain types of vandalism, perhaps), automation is clearly the way to go. But repetition does not necessarily equal tedium in all circumstances or for all people. Nor do we need to apply some business-type evaluation of what constitutes "productive" effort, at least in the context of volunteer work. If a task simply makes someone feel productive, their own evaluation is what matters, and it can help them feel more engaged and part of the community. My general point is that opportunities for automation are best considered with our overall mission in mind, not just the speed or efficiency of a particular workflow. In certain situations, automation that creates more work rather than removing it (such as by identifying potential tasks and feeding them to editors) might be preferable. And some of our tools already use such an approach, which is a good thing. --Michael Snow

Marc A. Pelletier

6:52 p.m.

On 11/20/2013 01:13 PM, Michael Snow wrote:

...

My general point is that opportunities for automation are best considered with our overall mission in mind, not just the speed or efficiency of a particular workflow. In certain situations, automation that creates more work rather than removing it (such as by identifying potential tasks and feeding them to editors) might be preferable. And some of our tools already use such an approach, which is a good thing.

That's an interesting approach, but I'm not sure how constructive it is in the long run. I suppose it depends greatly on whether one considers our mission to be 'building an encyclopedia to share in the sum[...]' or 'having an encyclopedia to share in the sum[...]' (I'm not sure if I make the subtle distinction here clear). Perhaps another way of putting it is to ask whether the encyclopedia-building community is the means or the ends. To my eyes, having "more contributors" is not valuable unless it has "better encyclopedia" as a direct consequence. -- Marc

David Gerard

7:11 p.m.

New subject: Copyright infringement - The real elephant in the room

On 20 November 2013 18:52, Marc A. Pelletier <marc(a)uberbox.org> wrote:

...

Perhaps another way of putting it is to ask whether the encyclopedia-building community is the means or the ends. To my eyes, having "more contributors" is not valuable unless it has "better encyclopedia" as a direct consequence.

I think it's not a sufficient condition, but that it is a necessary one. Think LibreOffice and their Easy Hacks list, for example - simple things a C++ coder could achieve even if unfamiliar with the (huge, hideous) code base: https://wiki.documentfoundation.org/Development/Easy_Hacks I know the standard en:wp {{welcome}} message used to suggest things that needed attention ... though frankly, many of them don't get attention because they're stultifyingly boring (most things in [[Category:Cleanup]] are never leaving it). - d.

Michael Snow

7:52 p.m.

On 11/20/2013 10:52 AM, Marc A. Pelletier wrote:

...

I believe the mission is sufficiently large in scope that having more people involved is fundamentally desirable in general. Although to circle back to an earlier point in the discussion, that doesn't require that we accept involvement that is counterproductive. Maintaining our standards is a way of acknowledging that the number of people involved is not itself the end goal. --Michael Snow

The Cunctator

5:06 p.m.

New subject: Copyright infringement - The real elephant in the room

There's also been discussion of automatically deleting content from contributors contributor from their own writing. On Nov 20, 2013 8:31 AM, "Marc A. Pelletier" <marc(a)uberbox.org> wrote:

...

On 11/20/2013 07:13 AM, The Cunctator wrote:

Yes, let's keep on pushing for policies that drive away editors!

Let's be clear here: contributions that are copyright violations are not desirable to begin with. If someone is driven away because they cannot cut and paste from random websites anymore, I'm not sure that this could reasonably be taken to be a bad thing. -- Marc _______________________________________________ Wikimedia-l mailing list Wikimedia-l(a)lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, <mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe>

Tobias

13 Nov 13 Nov

9:41 p.m.

On 11/13/2013 08:40 AM, James Heilman wrote:

...

Our biggest issue is copyright infringement.

When it comes to copyright infringement, among all community sites on the Internet, Wikipedia is one of the best to handle it. Many websites don't even bother with copyright unless they get a DMCA Takedown notice. We on the other hand have voluntary contributors checking pages and raising flags whenever there is even a suspicion of a copyright violation. This seems to be highly effective in many cases. A few days ago, I wrote an email to a photographer, whose photos had been uploaded to Commons. He said I was the third to ask him whether he really had uploaded those images (which he had). Unquestionably, there are also many instances where the systems fails and where lots of copyrighted material gets uploaded. Back in 2005, we had a case similar to the one you described in German Wikipedia, where various IPs copied content from old books. It is a big mess to clean up, but it can be done. And luckily the cases of massive copyvios are quite rare. I think the community has done a very good job in the past 12 years when it comes to copyright. It is important to see that we are a community site – nothing is ever going to be perfect, and certainly we are not free of any copyright violations. But we are dealing with them in a very responsible way and I would say that our current efforts are sufficient. Tobias

Martin Rulsch

11:22 p.m.

New subject: Copyright infringement - The real elephant in the room

...

Unquestionably, there are also many instances where the systems fails and where lots of copyrighted material gets uploaded. Back in 2005, we had a case similar to the one you described in German Wikipedia, where various IPs copied content from old books. It is a big mess to clean up, but it can be done. And luckily the cases of massive copyvios are quite rare.

For further information see https://de.wikipedia.org/wiki/Wikipedia:Archiv/DDR-URV/Presseinfo (German). Cheers Martin

Marc A. Pelletier

16 Nov 16 Nov

3:43 p.m.

On 11/13/2013 04:41 PM, Tobias wrote:

...

I think the community has done a very good job in the past 12 years when it comes to copyright. It is important to see that we are a community site – nothing is ever going to be perfect, and certainly we are not free of any copyright violations. But we are dealing with them in a very responsible way and I would say that our current efforts are sufficient.

I think that's the best way of summing it up. "Sufficient" is a vague metric, and leaves room for improvement, but the nutshell is that the community /does/ take copyright violations seriously and deploys very good efforts to curtail it. Do some slip through? Yes, without doubt. Are they eliminated with prejudice the second they are noticed? Yes. The Wikimedia projects are no worse than any other collected works when it comes to copyright infringement and indeed tends to handle it with more vigilance than the other sites in the top 10 (proactively, rather than reactively). Could we do better? No doubt. Is improvement so desperately critical that we should drop everything else to concentrate on that? Not a chance. And I speak as the author and (for a long time) maintainer of one of the most visible and used copyright violation detection tool used on our project (CorenSearchBot, now handled by MadmanBot and - last I heard - used on around a dozen projects). -- Marc

3830

days inactive

3839

days old

wikimedia-l@lists.wikimedia.org

Manage subscription

52 comments

30 participants

tags (0)

participants (30)

Andrew Gray
Andrew Lih
Anthony Cole
Benjamin Lees
Chris McKenna
David Gerard
Federico Leva (Nemo)
Florence Devouard
Fæ
George Herbert
Gerard Meijssen
James Heilman
Laura Hale
Lodewijk
Marc A. Pelletier
Marco Chiesa
Martijn Hoekstra
Martin Rulsch
Matthew Flaschen
Matthew Flaschen
Michael Snow
Nathan
Philippe Beaudette
Quim Gil
Richard Symonds
rupert THURNER
Samuel Klein
Steven Walling
The Cunctator
Tobias