Re: [Wiki-research-l] [Wikimedia-l] Catching copy and pasting early

21 Jul 2014


      Having dabbled in this initiative a couple years back when it first 
started to gain some traction, I'll make some comments.
Yes, CorenSearchBot (CSB) did/does(?) operate in this space. It 
basically searched took the title of a new article, searched for that 
term via the Yahoo! Search API, and looked for nearly-exact text matches 
among the first results (using an edit distance metric).
Through the hard work of Jake Orlowitz and others we got free access to 
the TurnItIn API (academic plagiarism detection). Their tool is much 
more sophisticated in terms of text matching and has access to material 
behind many pay-walls.
In terms of Jane's concern, we are (rather, "we imagine being") 
primarily limited to finding violations originating at new article 
creation or massive text insertions, because content already on WP has 
been scraped and re-copied so many times.
*I want to emphasize this is a gift-wrapped academic research project*. 
Jake, User:Madman, and myself even began amassing ground-truth to 
evaluate our approach. This was nearly a chapter in my dissertation. I 
would be very pleased for someone to come along, build a tool of 
practice, and also get themselves a WikiSym/CSCW paper in the process. I 
don't have the free cycles to do low-level coding, but I'd be happy to 
advise, comment, etc. to whatever degree someone would desire. Thanks, -AW
-- 
Andrew G. West, PhD
Research Scientist
Verisign Labs - Reston, VA
Website: http://www.andrew-g-west.com


On 07/21/2014 03:52 AM, Jane Darnell wrote:
> Isn't that what Corenbot does/did? I always found it very confusing
> though whenever I ran into it, and the false positives are huge (so many
> sites copy Wikimedia content these days)
>
>
> On Mon, Jul 21, 2014 at 9:11 AM, Pine W <wiki.pine@gmail.com
> mailto:wiki.pine@gmail.com> wrote:
>
>     It should be relatively easy to catch a significant percentage of those
>     copyright violations with the assistance of automated search tools. The
>     trick is to do it at a large scale in near-realtime, which might require
>     some computationally intensive and bandwidth intensive work. James,
>     can I
>     suggest that you take this discussion to Wiki-Research-l? There are a
>     number of ways that the copyright violation problem could be
>     addressed and
>     I think this would be a good subject for discussion on that list, or at
>     Wikimania. Depending on how the discussion on Research goes, it might be
>     good to invite some dev or tech ops people to participate in the
>     discussion
>     as well.
>
>     Pine
>
>
>     On Sun, Jul 20, 2014 at 7:05 PM, Leigh Thelmadatter
>     <osamadre@hotmail.com mailto:osamadre@hotmail.com>
>     wrote:
>
>      > This is one of the best ideas Ive read on here!
>      >
>      >
>      > > Date: Sun, 20 Jul 2014 20:00:28 -0600
>      > > From: jmh649@gmail.com mailto:jmh649@gmail.com
>      > > To: wikimedia-l@lists.wikimedia.org
>     mailto:wikimedia-l@lists.wikimedia.org; eloquence@gmail.com
>     mailto:eloquence@gmail.com;
>      > fschulenburg@wikimedia.org mailto:fschulenburg@wikimedia.org;
>     ladsgroup@gmail.com mailto:ladsgroup@gmail.com;
>     jorlowitz@gmail.com mailto:jorlowitz@gmail.com;
>      > madman.enwiki@gmail.com mailto:madman.enwiki@gmail.com;
>     west.andrew.g@gmail.com mailto:west.andrew.g@gmail.com
>      > > Subject: [Wikimedia-l] Catching copy and pasting early
>      > >
>      > > Come across another few thousand edits of copy and paste
>     violations again
>      > > today. These have occurred over more than a year. It is wearing
>     me out.
>      > > Really what is the point on collaborating on Wikipedia if it is
>     simply a
>      > > copyright violation. We need a solution and one has been
>     proposed here a
>      > > couple of years ago
>     https://en.wikipedia.org/wiki/Wikipedia:Turnitin
>      > >
>      > > We now need programmers to carry it out. The Wiki Education
>     Foundation
>      > has
>      > > expressed interest. We will need support from the foundation as
>     this
>      > > software will likely need to mesh closely with edits as they
>     come in. I
>      > am
>      > > willing to offer $5,000 dollars Canadian (almost the same as
>     American)
>      > for
>      > > a working solution that tags potential copyright issues in near
>     real time
>      > > with a greater than 90% accuracy. It is to function on at least all
>      > medical
>      > > and pharmacology articles but I would not complain if it worked
>     on all of
>      > > Wikipedia. The WMF is free to apply.
>      > >
>      > > --
>      > > James Heilman
>      > > MD, CCFP-EM, Wikipedian
>      > >
>      > > The Wikipedia Open Textbook of Medicine
>      > > www.opentextbookofmedicine.com
>     http://www.opentextbookofmedicine.com
>      > > _______________________________________________
>      > > Wikimedia-l mailing list, guidelines at:
>      > https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
>      > > Wikimedia-l@lists.wikimedia.org
>     mailto:Wikimedia-l@lists.wikimedia.org
>      > > Unsubscribe:
>     https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
>      > mailto:wikimedia-l-request@lists.wikimedia.org
     mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe>
>      >
>      > _______________________________________________
>      > Wikimedia-l mailing list, guidelines at:
>      > https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
>      > Wikimedia-l@lists.wikimedia.org
>     mailto:Wikimedia-l@lists.wikimedia.org
>      > Unsubscribe:
>     https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
>      > mailto:wikimedia-l-request@lists.wikimedia.org
     mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe>
>      >
>     _______________________________________________
>     Wikimedia-l mailing list, guidelines at:
>     https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
>     Wikimedia-l@lists.wikimedia.org
>     https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
     Wikimedia-l@lists.wikimedia.org>
>     Unsubscribe:
>     https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
>     mailto:wikimedia-l-request@lists.wikimedia.org
     mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe>
>
>
>
>
> _______________________________________________
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [Wiki-research-l] [Wikimedia-l] Catching copy and pasting early