Hey folks.
As James noted, Wiki Education Foundation is planning to do some work on this problem. I'll the project manager for it, and I'll be grateful for all the help and advice I can get. I'm in the process now of finding a development company to work with.
Our current plan is to complete a "feasibility study" by February 2014. Basically, that means doing enough exploratory development to get a clear picture of just how big a project it will be. The first goal would be to scratch our own itch: to set up a system that checks all edits made by student editors in our courses, and which highlights apparently plagiarism on a course dashboard (on wikiedu.org) and alerts the instructor and/or the student via email. However, if we can do that, it should provide a good starting point for scaling up to all of Wikipedia.
I think Jane is right to highlight the workflow problem. That's also a workflow that would be very different for a Wikipedia-wide system versus what I describe above, where we're working with editors in a specific context (course assignments) and we can communicate with them offwiki. My first idea would be something that notifies the responsible editors directly so that they can fix the problems themselves, rather than one that requires "field workers" to sift through the positives to clean up after others. The point would be to catch problems early, so that users correct their own behavior before they've done the same thing over and over again.
Nathan, finding answers to some of those questions will be part of the feasibility study. One of the key goals Wiki Ed has for this is to minimize false positives, so it we'll want to spend some time experimenting with what kinds of edits we can reliably detect as true positives. It may be that only edits of a certain size are worth checking, or only blocks of text that don't rewrite existing content. Regarding term papers, it might be a little confusing to refer to "Turnitin", as the working plan has been to use a different service from the same company, called iThenticate. This one is different from Turnitin in that it's more focused on checking content against published sources (on the web and in academic databases) and it doesn't include the database of previously-submitted papers like Turnitin.
Andrew: when we get closer to breaking ground, I'd love to talk it over with you.
Sage Ross
User:Sage (Wiki Ed) / User:Ragesoss Product Manager, Digital Services Wiki Education Foundation
On Mon, Jul 21, 2014 at 11:17 AM, Jane Darnell jane023@gmail.com wrote:
It's been a while, but as I recall, my problem with the Corenbot is the text that was inserted on the page (some loud banner with a link to the original text on some website, which was often not at all related to the matter at hand). My confusion was the instructional text in the link, and I wasn't sure if I should leave it or delete it (ah those were the days back when I thought my submissions were thoughtfully read the moment I pressed publish!). The problem with implementation of this sort of idea is that you need a bunch of field workers to sift through all of the positives, so you are sure you are not needlessly confusing some newbie somewhere. The bot is one thing, the workflow is something else entirely.
On Mon, Jul 21, 2014 at 4:29 PM, Nathan nawrich@gmail.com wrote:
On Mon, Jul 21, 2014 at 9:52 AM, Andrew G. West west.andrew.g@gmail.com wrote:
Having dabbled in this initiative a couple years back when it first started to gain some traction, I'll make some comments.
Yes, CorenSearchBot (CSB) did/does(?) operate in this space. It basically searched took the title of a new article, searched for that term via the Yahoo! Search API, and looked for nearly-exact text matches among the first results (using an edit distance metric).
Through the hard work of Jake Orlowitz and others we got free access to the TurnItIn API (academic plagiarism detection). Their tool is much more sophisticated in terms of text matching and has access to material behind many pay-walls.
In terms of Jane's concern, we are (rather, "we imagine being") primarily limited to finding violations originating at new article creation or massive text insertions, because content already on WP has been scraped and re-copied so many times.
*I want to emphasize this is a gift-wrapped academic research project*. Jake, User:Madman, and myself even began amassing ground-truth to evaluate our approach. This was nearly a chapter in my dissertation. I would be very pleased for someone to come along, build a tool of practice, and also get themselves a WikiSym/CSCW paper in the process. I don't have the free cycles to do low-level coding, but I'd be happy to advise, comment, etc. to whatever degree someone would desire. Thanks, -AW
-- Andrew G. West, PhD Research Scientist Verisign Labs - Reston, VA Website: http://www.andrew-g-west.com
Some questions that aren't answered by the Wikipedia:Turnitin page:
#Has any testing been done on a set of edits to see what the results might look like? I'm a little unconvinced on the idea of comparing edits with tens millions of term papers or other submissions. If testing hasn't begun, why not? What's lacking?
#The page says there will be no formal or contractual relationship between Turnitin and WMF, but I don't see how this can necessarily be true if its assumed Turnitin will be able to use the "Wikipedia" name in marketing material. Thoughts?
#What's the value of running the process against all edits (many of which may be minor, or not involve any substantial text insertions) vs. skimming all or a subset of all pages each day? (I'm assuming a few million more pageloads per day won't affect the Wikimedia servers substantially).
#What mechanism would be used to add the report link to the talkpages? A bot account operated by Turnitin? Would access to the Turnitin database be restricted / proprietary, or could other bot developers query it for various purposes?
It sounds like there's a desire to just skip to the end and agree to switch Turnitin on as a scan for all edits, but I think these questions and more will need to be answered before people will agree to anything like full scale implementation.
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l