It should be relatively easy to catch a significant percentage of those copyright violations with the assistance of automated search tools. The trick is to do it at a large scale in near-realtime, which might require some computationally intensive and bandwidth intensive work. James, can I suggest that you take this discussion to Wiki-Research-l? There are a number of ways that the copyright violation problem could be addressed and I think this would be a good subject for discussion on that list, or at Wikimania. Depending on how the discussion on Research goes, it might be good to invite some dev or tech ops people to participate in the discussion as well.
Pine
On Sun, Jul 20, 2014 at 7:05 PM, Leigh Thelmadatter osamadre@hotmail.com wrote:
This is one of the best ideas Ive read on here!
Date: Sun, 20 Jul 2014 20:00:28 -0600 From: jmh649@gmail.com To: wikimedia-l@lists.wikimedia.org; eloquence@gmail.com;
fschulenburg@wikimedia.org; ladsgroup@gmail.com; jorlowitz@gmail.com; madman.enwiki@gmail.com; west.andrew.g@gmail.com
Subject: [Wikimedia-l] Catching copy and pasting early
Come across another few thousand edits of copy and paste violations again today. These have occurred over more than a year. It is wearing me out. Really what is the point on collaborating on Wikipedia if it is simply a copyright violation. We need a solution and one has been proposed here a couple of years ago https://en.wikipedia.org/wiki/Wikipedia:Turnitin
We now need programmers to carry it out. The Wiki Education Foundation
has
expressed interest. We will need support from the foundation as this software will likely need to mesh closely with edits as they come in. I
am
willing to offer $5,000 dollars Canadian (almost the same as American)
for
a working solution that tags potential copyright issues in near real time with a greater than 90% accuracy. It is to function on at least all
medical
and pharmacology articles but I would not complain if it worked on all of Wikipedia. The WMF is free to apply.
-- James Heilman MD, CCFP-EM, Wikipedian
The Wikipedia Open Textbook of Medicine www.opentextbookofmedicine.com _______________________________________________ Wikimedia-l mailing list, guidelines at:
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe
Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe
Isn't that what Corenbot does/did? I always found it very confusing though whenever I ran into it, and the false positives are huge (so many sites copy Wikimedia content these days)
On Mon, Jul 21, 2014 at 9:11 AM, Pine W wiki.pine@gmail.com wrote:
It should be relatively easy to catch a significant percentage of those copyright violations with the assistance of automated search tools. The trick is to do it at a large scale in near-realtime, which might require some computationally intensive and bandwidth intensive work. James, can I suggest that you take this discussion to Wiki-Research-l? There are a number of ways that the copyright violation problem could be addressed and I think this would be a good subject for discussion on that list, or at Wikimania. Depending on how the discussion on Research goes, it might be good to invite some dev or tech ops people to participate in the discussion as well.
Pine
On Sun, Jul 20, 2014 at 7:05 PM, Leigh Thelmadatter osamadre@hotmail.com wrote:
This is one of the best ideas Ive read on here!
Date: Sun, 20 Jul 2014 20:00:28 -0600 From: jmh649@gmail.com To: wikimedia-l@lists.wikimedia.org; eloquence@gmail.com;
fschulenburg@wikimedia.org; ladsgroup@gmail.com; jorlowitz@gmail.com; madman.enwiki@gmail.com; west.andrew.g@gmail.com
Subject: [Wikimedia-l] Catching copy and pasting early
Come across another few thousand edits of copy and paste violations
again
today. These have occurred over more than a year. It is wearing me out. Really what is the point on collaborating on Wikipedia if it is simply
a
copyright violation. We need a solution and one has been proposed here
a
couple of years ago https://en.wikipedia.org/wiki/Wikipedia:Turnitin
We now need programmers to carry it out. The Wiki Education Foundation
has
expressed interest. We will need support from the foundation as this software will likely need to mesh closely with edits as they come in. I
am
willing to offer $5,000 dollars Canadian (almost the same as American)
for
a working solution that tags potential copyright issues in near real
time
with a greater than 90% accuracy. It is to function on at least all
medical
and pharmacology articles but I would not complain if it worked on all
of
Wikipedia. The WMF is free to apply.
-- James Heilman MD, CCFP-EM, Wikipedian
The Wikipedia Open Textbook of Medicine www.opentextbookofmedicine.com _______________________________________________ Wikimedia-l mailing list, guidelines at:
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe
Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe
Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe
Having dabbled in this initiative a couple years back when it first started to gain some traction, I'll make some comments.
Yes, CorenSearchBot (CSB) did/does(?) operate in this space. It basically searched took the title of a new article, searched for that term via the Yahoo! Search API, and looked for nearly-exact text matches among the first results (using an edit distance metric).
Through the hard work of Jake Orlowitz and others we got free access to the TurnItIn API (academic plagiarism detection). Their tool is much more sophisticated in terms of text matching and has access to material behind many pay-walls.
In terms of Jane's concern, we are (rather, "we imagine being") primarily limited to finding violations originating at new article creation or massive text insertions, because content already on WP has been scraped and re-copied so many times.
*I want to emphasize this is a gift-wrapped academic research project*. Jake, User:Madman, and myself even began amassing ground-truth to evaluate our approach. This was nearly a chapter in my dissertation. I would be very pleased for someone to come along, build a tool of practice, and also get themselves a WikiSym/CSCW paper in the process. I don't have the free cycles to do low-level coding, but I'd be happy to advise, comment, etc. to whatever degree someone would desire. Thanks, -AW
On Mon, Jul 21, 2014 at 9:52 AM, Andrew G. West west.andrew.g@gmail.com wrote:
Having dabbled in this initiative a couple years back when it first started to gain some traction, I'll make some comments.
Yes, CorenSearchBot (CSB) did/does(?) operate in this space. It basically searched took the title of a new article, searched for that term via the Yahoo! Search API, and looked for nearly-exact text matches among the first results (using an edit distance metric).
Through the hard work of Jake Orlowitz and others we got free access to the TurnItIn API (academic plagiarism detection). Their tool is much more sophisticated in terms of text matching and has access to material behind many pay-walls.
In terms of Jane's concern, we are (rather, "we imagine being") primarily limited to finding violations originating at new article creation or massive text insertions, because content already on WP has been scraped and re-copied so many times.
*I want to emphasize this is a gift-wrapped academic research project*. Jake, User:Madman, and myself even began amassing ground-truth to evaluate our approach. This was nearly a chapter in my dissertation. I would be very pleased for someone to come along, build a tool of practice, and also get themselves a WikiSym/CSCW paper in the process. I don't have the free cycles to do low-level coding, but I'd be happy to advise, comment, etc. to whatever degree someone would desire. Thanks, -AW
-- Andrew G. West, PhD Research Scientist Verisign Labs - Reston, VA Website: http://www.andrew-g-west.com
Some questions that aren't answered by the Wikipedia:Turnitin page:
#Has any testing been done on a set of edits to see what the results might look like? I'm a little unconvinced on the idea of comparing edits with tens millions of term papers or other submissions. If testing hasn't begun, why not? What's lacking?
#The page says there will be no formal or contractual relationship between Turnitin and WMF, but I don't see how this can necessarily be true if its assumed Turnitin will be able to use the "Wikipedia" name in marketing material. Thoughts?
#What's the value of running the process against all edits (many of which may be minor, or not involve any substantial text insertions) vs. skimming all or a subset of all pages each day? (I'm assuming a few million more pageloads per day won't affect the Wikimedia servers substantially).
#What mechanism would be used to add the report link to the talkpages? A bot account operated by Turnitin? Would access to the Turnitin database be restricted / proprietary, or could other bot developers query it for various purposes?
It sounds like there's a desire to just skip to the end and agree to switch Turnitin on as a scan for all edits, but I think these questions and more will need to be answered before people will agree to anything like full scale implementation.
It's been a while, but as I recall, my problem with the Corenbot is the text that was inserted on the page (some loud banner with a link to the original text on some website, which was often not at all related to the matter at hand). My confusion was the instructional text in the link, and I wasn't sure if I should leave it or delete it (ah those were the days back when I thought my submissions were thoughtfully read the moment I pressed publish!). The problem with implementation of this sort of idea is that you need a bunch of field workers to sift through all of the positives, so you are sure you are not needlessly confusing some newbie somewhere. The bot is one thing, the workflow is something else entirely.
On Mon, Jul 21, 2014 at 4:29 PM, Nathan nawrich@gmail.com wrote:
On Mon, Jul 21, 2014 at 9:52 AM, Andrew G. West west.andrew.g@gmail.com wrote:
Having dabbled in this initiative a couple years back when it first started to gain some traction, I'll make some comments.
Yes, CorenSearchBot (CSB) did/does(?) operate in this space. It basically searched took the title of a new article, searched for that term via the Yahoo! Search API, and looked for nearly-exact text matches among the first results (using an edit distance metric).
Through the hard work of Jake Orlowitz and others we got free access to the TurnItIn API (academic plagiarism detection). Their tool is much more sophisticated in terms of text matching and has access to material behind many pay-walls.
In terms of Jane's concern, we are (rather, "we imagine being") primarily limited to finding violations originating at new article creation or massive text insertions, because content already on WP has been scraped and re-copied so many times.
*I want to emphasize this is a gift-wrapped academic research project*. Jake, User:Madman, and myself even began amassing ground-truth to evaluate our approach. This was nearly a chapter in my dissertation. I would be very pleased for someone to come along, build a tool of practice, and also get themselves a WikiSym/CSCW paper in the process. I don't have the free cycles to do low-level coding, but I'd be happy to advise, comment, etc. to whatever degree someone would desire. Thanks, -AW
-- Andrew G. West, PhD Research Scientist Verisign Labs - Reston, VA Website: http://www.andrew-g-west.com
Some questions that aren't answered by the Wikipedia:Turnitin page:
#Has any testing been done on a set of edits to see what the results might look like? I'm a little unconvinced on the idea of comparing edits with tens millions of term papers or other submissions. If testing hasn't begun, why not? What's lacking?
#The page says there will be no formal or contractual relationship between Turnitin and WMF, but I don't see how this can necessarily be true if its assumed Turnitin will be able to use the "Wikipedia" name in marketing material. Thoughts?
#What's the value of running the process against all edits (many of which may be minor, or not involve any substantial text insertions) vs. skimming all or a subset of all pages each day? (I'm assuming a few million more pageloads per day won't affect the Wikimedia servers substantially).
#What mechanism would be used to add the report link to the talkpages? A bot account operated by Turnitin? Would access to the Turnitin database be restricted / proprietary, or could other bot developers query it for various purposes?
It sounds like there's a desire to just skip to the end and agree to switch Turnitin on as a scan for all edits, but I think these questions and more will need to be answered before people will agree to anything like full scale implementation.
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Hey folks.
As James noted, Wiki Education Foundation is planning to do some work on this problem. I'll the project manager for it, and I'll be grateful for all the help and advice I can get. I'm in the process now of finding a development company to work with.
Our current plan is to complete a "feasibility study" by February 2014. Basically, that means doing enough exploratory development to get a clear picture of just how big a project it will be. The first goal would be to scratch our own itch: to set up a system that checks all edits made by student editors in our courses, and which highlights apparently plagiarism on a course dashboard (on wikiedu.org) and alerts the instructor and/or the student via email. However, if we can do that, it should provide a good starting point for scaling up to all of Wikipedia.
I think Jane is right to highlight the workflow problem. That's also a workflow that would be very different for a Wikipedia-wide system versus what I describe above, where we're working with editors in a specific context (course assignments) and we can communicate with them offwiki. My first idea would be something that notifies the responsible editors directly so that they can fix the problems themselves, rather than one that requires "field workers" to sift through the positives to clean up after others. The point would be to catch problems early, so that users correct their own behavior before they've done the same thing over and over again.
Nathan, finding answers to some of those questions will be part of the feasibility study. One of the key goals Wiki Ed has for this is to minimize false positives, so it we'll want to spend some time experimenting with what kinds of edits we can reliably detect as true positives. It may be that only edits of a certain size are worth checking, or only blocks of text that don't rewrite existing content. Regarding term papers, it might be a little confusing to refer to "Turnitin", as the working plan has been to use a different service from the same company, called iThenticate. This one is different from Turnitin in that it's more focused on checking content against published sources (on the web and in academic databases) and it doesn't include the database of previously-submitted papers like Turnitin.
Andrew: when we get closer to breaking ground, I'd love to talk it over with you.
Sage Ross
User:Sage (Wiki Ed) / User:Ragesoss Product Manager, Digital Services Wiki Education Foundation
On Mon, Jul 21, 2014 at 11:17 AM, Jane Darnell jane023@gmail.com wrote:
It's been a while, but as I recall, my problem with the Corenbot is the text that was inserted on the page (some loud banner with a link to the original text on some website, which was often not at all related to the matter at hand). My confusion was the instructional text in the link, and I wasn't sure if I should leave it or delete it (ah those were the days back when I thought my submissions were thoughtfully read the moment I pressed publish!). The problem with implementation of this sort of idea is that you need a bunch of field workers to sift through all of the positives, so you are sure you are not needlessly confusing some newbie somewhere. The bot is one thing, the workflow is something else entirely.
On Mon, Jul 21, 2014 at 4:29 PM, Nathan nawrich@gmail.com wrote:
On Mon, Jul 21, 2014 at 9:52 AM, Andrew G. West west.andrew.g@gmail.com wrote:
Having dabbled in this initiative a couple years back when it first started to gain some traction, I'll make some comments.
Yes, CorenSearchBot (CSB) did/does(?) operate in this space. It basically searched took the title of a new article, searched for that term via the Yahoo! Search API, and looked for nearly-exact text matches among the first results (using an edit distance metric).
Through the hard work of Jake Orlowitz and others we got free access to the TurnItIn API (academic plagiarism detection). Their tool is much more sophisticated in terms of text matching and has access to material behind many pay-walls.
In terms of Jane's concern, we are (rather, "we imagine being") primarily limited to finding violations originating at new article creation or massive text insertions, because content already on WP has been scraped and re-copied so many times.
*I want to emphasize this is a gift-wrapped academic research project*. Jake, User:Madman, and myself even began amassing ground-truth to evaluate our approach. This was nearly a chapter in my dissertation. I would be very pleased for someone to come along, build a tool of practice, and also get themselves a WikiSym/CSCW paper in the process. I don't have the free cycles to do low-level coding, but I'd be happy to advise, comment, etc. to whatever degree someone would desire. Thanks, -AW
-- Andrew G. West, PhD Research Scientist Verisign Labs - Reston, VA Website: http://www.andrew-g-west.com
Some questions that aren't answered by the Wikipedia:Turnitin page:
#Has any testing been done on a set of edits to see what the results might look like? I'm a little unconvinced on the idea of comparing edits with tens millions of term papers or other submissions. If testing hasn't begun, why not? What's lacking?
#The page says there will be no formal or contractual relationship between Turnitin and WMF, but I don't see how this can necessarily be true if its assumed Turnitin will be able to use the "Wikipedia" name in marketing material. Thoughts?
#What's the value of running the process against all edits (many of which may be minor, or not involve any substantial text insertions) vs. skimming all or a subset of all pages each day? (I'm assuming a few million more pageloads per day won't affect the Wikimedia servers substantially).
#What mechanism would be used to add the report link to the talkpages? A bot account operated by Turnitin? Would access to the Turnitin database be restricted / proprietary, or could other bot developers query it for various purposes?
It sounds like there's a desire to just skip to the end and agree to switch Turnitin on as a scan for all edits, but I think these questions and more will need to be answered before people will agree to anything like full scale implementation.
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
In light of the editor retention problem, I suggest we have to be very careful with any kind of "plagiarism detector" software because we have real subject matter experts among our editors. I'm aware of members of local history societies who have had issues with copyright violation because they have content on their own websites which they then contribute to Wikipedia. It's not a copyright violation because it's their own work, but it was deleted, they were accused of copyright violation and they were naturally very unhappy about both. Being new users they did not know any way to get this redressed, they asked me for help and I got nowhere with the editor who deleted the material who would not accept their assertion that they were the original authors (how on earth could they prove it?). As a result, none of them are now active editors. Having had a whole bunch of my own images nearly deleted from Commons because they appear on my own website (despite my user name being my real name and my real name is all over my website), I know how they feel about having accusations of copyright violation all over your contributions - it's really offensive. Strangely we have no way to whitelist particular websites in relation to particular users (in theory, you'd want to be able to whitelist books and off-line resources too but in practice "copies" from these are far less likely to be noticed), so the same problem can arise again and again for an individual contributor.
So I would be very hesitant about putting any visible tag on an article suggesting it was a copyright violation (as it seems to me it is both offensive and potentially libellous to the editor who has in good faith contributed their own work). I think any concern about copyright has to be first raised with the editor involved as a question NOT an accusation. And I note that it is often very difficult to communicate with new/occasional editors as they often have no email address associated with their account and they don't see talk page message banners unless they are remember-me logged-in. It's ironic that at a time a contributor is most likely to want/need help, we are in the worst position to know they want it or offer it if we see they need it.
So, I'm with Jane on this one. It's easy enough to detect a lot of potential copyright violations automatically. What's hard and very much a manual task is confirming it really is a copyright violation and, where required, educating the contributor. I think there's a real danger to automating the first part without a good solution to the second part. We have far too many editors who use tools as weapons already, so I am reluctant to give them more weapons.
Kerry
wiki-research-l@lists.wikimedia.org