Re: [Wiki-research-l] [Wikimedia-l] Catching copy and pasting early

21 Jul 2014


      Hey folks.
As James noted, Wiki Education Foundation is planning to do some work
on this problem. I'll the project manager for it, and I'll be grateful
for all the help and advice I can get. I'm in the process now of
finding a development company to work with.
Our current plan is to complete a "feasibility study" by February
2014. Basically, that means doing enough exploratory development to
get a clear picture of just how big a project it will be. The first
goal would be to scratch our own itch: to set up a system that checks
all edits made by student editors in our courses, and which highlights
apparently plagiarism on a course dashboard (on wikiedu.org) and
alerts the instructor and/or the student via email. However, if we can
do that, it should provide a good starting point for scaling up to all
of Wikipedia.
I think Jane is right to highlight the workflow problem. That's also a
workflow that would be very different for a Wikipedia-wide system
versus what I describe above, where we're working with editors in a
specific context (course assignments) and we can communicate with them
offwiki. My first idea would be something that notifies the
responsible editors directly so that they can fix the problems
themselves, rather than one that requires "field workers" to sift
through the positives to clean up after others. The point would be to
catch problems early, so that users correct their own behavior before
they've done the same thing over and over again.
Nathan, finding answers to some of those questions will be part of the
feasibility study. One of the key goals Wiki Ed has for this is to
minimize false positives, so it we'll want to spend some time
experimenting with what kinds of edits we can reliably detect as true
positives. It may be that only edits of a certain size are worth
checking, or only blocks of text that don't rewrite existing content.
Regarding term papers, it might be a little confusing to refer to
"Turnitin", as the working plan has been to use a different service
from the same company, called iThenticate. This one is different from
Turnitin in that it's more focused on checking content against
published sources (on the web and in academic databases) and it
doesn't include the database of previously-submitted papers like
Turnitin.
Andrew: when we get closer to breaking ground, I'd love to talk it
over with you.
Sage Ross
User:Sage (Wiki Ed) / User:Ragesoss
Product Manager, Digital Services
Wiki Education Foundation
On Mon, Jul 21, 2014 at 11:17 AM, Jane Darnell jane023@gmail.com wrote:
...
It's been a while, but as I recall, my problem with the Corenbot is the text
that was inserted on the page (some loud banner with a link to the original
text on some website, which was often not at all related to the matter at
hand). My confusion was the instructional text in the link, and I wasn't
sure if I should leave it or delete it (ah those were the days back when I
thought my submissions were thoughtfully read the moment I pressed
publish!). The problem with implementation of this sort of idea is that you
need a bunch of field workers to sift through all of the positives, so you
are sure you are not needlessly confusing some newbie somewhere. The bot is
one thing, the workflow is something else entirely.
On Mon, Jul 21, 2014 at 4:29 PM, Nathan nawrich@gmail.com wrote:
...
On Mon, Jul 21, 2014 at 9:52 AM, Andrew G. West west.andrew.g@gmail.com
wrote:
...
Having dabbled in this initiative a couple years back when it first
started to gain some traction, I'll make some comments.
Yes, CorenSearchBot (CSB) did/does(?) operate in this space. It basically
searched took the title of a new article, searched for that term via the
Yahoo! Search API, and looked for nearly-exact text matches among the first
results (using an edit distance metric).
Through the hard work of Jake Orlowitz and others we got free access to
the TurnItIn API (academic plagiarism detection). Their tool is much more
sophisticated in terms of text matching and has access to material behind
many pay-walls.
In terms of Jane's concern, we are (rather, "we imagine being") primarily
limited to finding violations originating at new article creation or massive
text insertions, because content already on WP has been scraped and
re-copied so many times.
*I want to emphasize this is a gift-wrapped academic research project*.
Jake, User:Madman, and myself even began amassing ground-truth to evaluate
our approach. This was nearly a chapter in my dissertation. I would be very
pleased for someone to come along, build a tool of practice, and also get
themselves a WikiSym/CSCW paper in the process. I don't have the free cycles
to do low-level coding, but I'd be happy to advise, comment, etc. to
whatever degree someone would desire. Thanks, -AW
--
Andrew G. West, PhD
Research Scientist
Verisign Labs - Reston, VA
Website: http://www.andrew-g-west.com
Some questions that aren't answered by the Wikipedia:Turnitin page:
#Has any testing been done on a set of edits to see what the results might
look like? I'm a little unconvinced on the idea of comparing edits with tens
millions of term papers or other submissions. If testing hasn't begun, why
not? What's lacking?
#The page says there will be no formal or contractual relationship between
Turnitin and WMF, but I don't see how this can necessarily be true if its
assumed Turnitin will be able to use the "Wikipedia" name in marketing
material. Thoughts?
#What's the value of running the process against all edits (many of which
may be minor, or not involve any substantial text insertions) vs. skimming
all or a subset of all pages each day? (I'm assuming a few million more
pageloads per day won't affect the Wikimedia servers substantially).
#What mechanism would be used to add the report link to the talkpages? A
bot account operated by Turnitin? Would access to the Turnitin database be
restricted / proprietary, or could other bot developers query it for various
purposes?
It sounds like there's a desire to just skip to the end and agree to
switch Turnitin on as a scan for all edits, but I think these questions and
more will need to be answered before people will agree to anything like full
scale implementation.

Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [Wiki-research-l] [Wikimedia-l] Catching copy and pasting early