On Mon, Jul 21, 2014 at 9:52 AM, Andrew G. West <west.andrew.g@gmail.com> wrote:
Having dabbled in this initiative a couple years back when it first started to gain some traction, I'll make some comments.

Yes, CorenSearchBot (CSB) did/does(?) operate in this space. It basically searched took the title of a new article, searched for that term via the Yahoo! Search API, and looked for nearly-exact text matches among the first results (using an edit distance metric).

Through the hard work of Jake Orlowitz and others we got free access to the TurnItIn API (academic plagiarism detection). Their tool is much more sophisticated in terms of text matching and has access to material behind many pay-walls.

In terms of Jane's concern, we are (rather, "we imagine being") primarily limited to finding violations originating at new article creation or massive text insertions, because content already on WP has been scraped and re-copied so many times.

*I want to emphasize this is a gift-wrapped academic research project*. Jake, User:Madman, and myself even began amassing ground-truth to evaluate our approach. This was nearly a chapter in my dissertation. I would be very pleased for someone to come along, build a tool of practice, and also get themselves a WikiSym/CSCW paper in the process. I don't have the free cycles to do low-level coding, but I'd be happy to advise, comment, etc. to whatever degree someone would desire. Thanks, -AW

--
Andrew G. West, PhD
Research Scientist
Verisign Labs - Reston, VA
Website: http://www.andrew-g-west.com



Some questions that aren't answered by the Wikipedia:Turnitin page:

#Has any testing been done on a set of edits to see what the results might look like? I'm a little unconvinced on the idea of comparing edits with tens millions of term papers or other submissions. If testing hasn't begun, why not? What's lacking?

#The page says there will be no formal or contractual relationship between Turnitin and WMF, but I don't see how this can necessarily be true if its assumed Turnitin will be able to use the "Wikipedia" name in marketing material. Thoughts? 

#What's the value of running the process against all edits (many of which may be minor, or not involve any substantial text insertions) vs. skimming all or a subset of all pages each day? (I'm assuming a few million more pageloads per day won't affect the Wikimedia servers substantially). 

#What mechanism would be used to add the report link to the talkpages? A bot account operated by Turnitin? Would access to the Turnitin database be restricted / proprietary, or could other bot developers query it for various purposes? 

It sounds like there's a desire to just skip to the end and agree to switch Turnitin on as a scan for all edits, but I think these questions and more will need to be answered before people will agree to anything like full scale implementation.