Having dabbled in this initiative a couple years back when it first
started to gain some traction, I'll make some comments.
Yes, CorenSearchBot (CSB) did/does(?) operate in this space. It
basically searched took the title of a new article, searched for that
term via the Yahoo! Search API, and looked for nearly-exact text matches
among the first results (using an edit distance metric).
Through the hard work of Jake Orlowitz and others we got free access to
the TurnItIn API (academic plagiarism detection). Their tool is much
more sophisticated in terms of text matching and has access to material
behind many pay-walls.
In terms of Jane's concern, we are (rather, "we imagine being")
primarily limited to finding violations originating at new article
creation or massive text insertions, because content already on WP has
been scraped and re-copied so many times.
*I want to emphasize this is a gift-wrapped academic research project*.
Jake, User:Madman, and myself even began amassing ground-truth to
evaluate our approach. This was nearly a chapter in my dissertation. I
would be very pleased for someone to come along, build a tool of
practice, and also get themselves a WikiSym/CSCW paper in the process. I
don't have the free cycles to do low-level coding, but I'd be happy to
advise, comment, etc. to whatever degree someone would desire. Thanks, -AW
--
Andrew G. West, PhD
Research Scientist
Verisign Labs - Reston, VA
Website:
http://www.andrew-g-west.com
On 07/21/2014 03:52 AM, Jane Darnell wrote:
> Isn't that what Corenbot does/did? I always found it very confusing
> though whenever I ran into it, and the false positives are huge (so many
> sites copy Wikimedia content these days)
>
>
> On Mon, Jul 21, 2014 at 9:11 AM, Pine W <wiki.pine@gmail.com
>
mailto:wiki.pine@gmail.com> wrote:
>
> It should be relatively easy to catch a significant percentage of those
> copyright violations with the assistance of automated search tools. The
> trick is to do it at a large scale in near-realtime, which might require
> some computationally intensive and bandwidth intensive work. James,
> can I
> suggest that you take this discussion to Wiki-Research-l? There are a
> number of ways that the copyright violation problem could be
> addressed and
> I think this would be a good subject for discussion on that list, or at
> Wikimania. Depending on how the discussion on Research goes, it might be
> good to invite some dev or tech ops people to participate in the
> discussion
> as well.
>
> Pine
>
>
> On Sun, Jul 20, 2014 at 7:05 PM, Leigh Thelmadatter
> <osamadre@hotmail.com
mailto:osamadre@hotmail.com>
> wrote:
>
> > This is one of the best ideas Ive read on here!
> >
> >
> > > Date: Sun, 20 Jul 2014 20:00:28 -0600
> > > From: jmh649@gmail.com
mailto:jmh649@gmail.com
> > > To: wikimedia-l@lists.wikimedia.org
>
mailto:wikimedia-l@lists.wikimedia.org; eloquence@gmail.com
>
mailto:eloquence@gmail.com;
> > fschulenburg@wikimedia.org
mailto:fschulenburg@wikimedia.org;
> ladsgroup@gmail.com
mailto:ladsgroup@gmail.com;
> jorlowitz@gmail.com
mailto:jorlowitz@gmail.com;
> > madman.enwiki@gmail.com
mailto:madman.enwiki@gmail.com;
> west.andrew.g@gmail.com
mailto:west.andrew.g@gmail.com
> > > Subject: [Wikimedia-l] Catching copy and pasting early
> > >
> > > Come across another few thousand edits of copy and paste
> violations again
> > > today. These have occurred over more than a year. It is wearing
> me out.
> > > Really what is the point on collaborating on Wikipedia if it is
> simply a
> > > copyright violation. We need a solution and one has been
> proposed here a
> > > couple of years ago
>
https://en.wikipedia.org/wiki/Wikipedia:Turnitin
> > >
> > > We now need programmers to carry it out. The Wiki Education
> Foundation
> > has
> > > expressed interest. We will need support from the foundation as
> this
> > > software will likely need to mesh closely with edits as they
> come in. I
> > am
> > > willing to offer $5,000 dollars Canadian (almost the same as
> American)
> > for
> > > a working solution that tags potential copyright issues in near
> real time
> > > with a greater than 90% accuracy. It is to function on at least all
> > medical
> > > and pharmacology articles but I would not complain if it worked
> on all of
> > > Wikipedia. The WMF is free to apply.
> > >
> > > --
> > > James Heilman
> > > MD, CCFP-EM, Wikipedian
> > >
> > > The Wikipedia Open Textbook of Medicine
> > > www.opentextbookofmedicine.com
>
http://www.opentextbookofmedicine.com
> > > _______________________________________________
> > > Wikimedia-l mailing list, guidelines at:
> >
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
> > > Wikimedia-l@lists.wikimedia.org
>
mailto:Wikimedia-l@lists.wikimedia.org
> > > Unsubscribe:
>
https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> >
mailto:wikimedia-l-request@lists.wikimedia.org
mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe>
> >
> > _______________________________________________
> > Wikimedia-l mailing list, guidelines at:
> >
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
> > Wikimedia-l@lists.wikimedia.org
>
mailto:Wikimedia-l@lists.wikimedia.org
> > Unsubscribe:
>
https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> >
mailto:wikimedia-l-request@lists.wikimedia.org
mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe>
> >
> _______________________________________________
> Wikimedia-l mailing list, guidelines at:
>
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
> Wikimedia-l@lists.wikimedia.org
>
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
Wikimedia-l@lists.wikimedia.org>
> Unsubscribe:
>
https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
>
mailto:wikimedia-l-request@lists.wikimedia.org
mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe>
>
>
>
>
> _______________________________________________
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
>
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>