We really need a plagiarism detection tool so that we can make sure our sources are not simply "copy and pastes" of older versions of Wikipedia. Today I was happily improving our article on pneumonia as I have a day off. I came across a recommendation that baby's should be suction at birth to decrease their risk of pneumonia with a {{cn}} tag. So I went to Google books and up came a book that supported it perfectly. And than I noticed that this book supported the previous and next few sentences as well. It also supported a number of other sections we had in the article but was missing our references. The book was selling for $340 a copy. Our articles have improved a great deal since 2007 and yet school are buying copy edited version of Wikipedia from 5 years ago. The bit about suctioning babies at birth is was wrong and I have corrected it. I think we need to get this news out. Support Wikipedia and use the latest version online!
Further details / discuss are here http://en.wikipedia.org/wiki/Wikipedia_talk:WikiProject_Medicine#Can_we_stil...
How hard would it be to set up a tool like the software that as far as I know the MIT uses to automatically check plagiarism among thesis etc. submitted to their digital library, checking the text of all Wikimedia projects against e.g. newspaper websites and Google Books, and then publishing the results in some visually appealing way to show how much newspapers copy from Wikipedia and each other? On it.wiki we regularly see complaints and unhappy discussions about newspaper articles which are just a copy and paste from Wikipedia and still feature a "COPY RESERVED" warning without citing any source... newspapers are by definition arrogant, so nothing can be done to stop them, but an informative tool would be useful and might be as effective as wikiscanner was with regard to IP editing from organizations.
Nemo
James Heilman, 18/10/2012 07:26:
We really need a plagiarism detection tool so that we can make sure our sources are not simply "copy and pastes" of older versions of Wikipedia. Today I was happily improving our article on pneumonia as I have a day off. I came across a recommendation that baby's should be suction at birth to decrease their risk of pneumonia with a {{cn}} tag. So I went to Google books and up came a book that supported it perfectly. And than I noticed that this book supported the previous and next few sentences as well. It also supported a number of other sections we had in the article but was missing our references. The book was selling for $340 a copy. Our articles have improved a great deal since 2007 and yet school are buying copy edited version of Wikipedia from 5 years ago. The bit about suctioning babies at birth is was wrong and I have corrected it. I think we need to get this news out. Support Wikipedia and use the latest version online!
Further details / discuss are here http://en.wikipedia.org/wiki/Wikipedia_talk:WikiProject_Medicine#Can_we_stil...
On Thu, Oct 18, 2012 at 6:26 AM, James Heilman jmh649@gmail.com wrote:
We really need a plagiarism detection tool so that we can make sure our sources are not simply "copy and pastes" of older versions of Wikipedia. Today I was happily improving our article on pneumonia as I have a day off. I came across a recommendation that baby's should be suction at birth to decrease their risk of pneumonia with a {{cn}} tag. So I went to Google books and up came a book that supported it perfectly. And than I noticed that this book supported the previous and next few sentences as well. It also supported a number of other sections we had in the article but was missing our references. The book was selling for $340 a copy. Our articles have improved a great deal since 2007 and yet school are buying copy edited version of Wikipedia from 5 years ago. The bit about suctioning babies at birth is was wrong and I have corrected it. I think we need to get this news out. Support Wikipedia and use the latest version online!
It's sort of unrelated, but there's a project called Common Crawl:
It is trying to produce an "open crawl of the web" (much as Google, Bing etc. have for their search engines).
Now that the copyvio bot is down, I'm wondering if someone would be interested in building something that used the Common Crawl database, or whether that'd be practical.
On 10/17/12 10:26 PM, James Heilman wrote:
We really need a plagiarism detection tool so that we can make sure our sources are not simply "copy and pastes" of older versions of Wikipedia. Today I was happily improving our article on pneumonia as I have a day off. I came across a recommendation that baby's should be suction at birth to decrease their risk of pneumonia with a {{cn}} tag. So I went to Google books and up came a book that supported it perfectly. And than I noticed that this book supported the previous and next few sentences as well. It also supported a number of other sections we had in the article but was missing our references. The book was selling for $340 a copy. Our articles have improved a great deal since 2007 and yet school are buying copy edited version of Wikipedia from 5 years ago. The bit about suctioning babies at birth is was wrong and I have corrected it. I think we need to get this news out. Support Wikipedia and use the latest version online!
Further details / discuss are here http://en.wikipedia.org/wiki/Wikipedia_talk:WikiProject_Medicine#Can_we_stil...
This situation was entirely predictable, even if its particular circumstances weren't. I ran into something of the sort as far back as 2003. I have long since lost track of references to the incident; it had to do with literary biographies of long dead authors, and thus much less critical than in a medical article. The broader question goes well beyond simple matters of plagiarism or copyright infringement. The passages will often be short enough that a fair dealing claim is available, and the moral right to be credited for one's work has no meaningful legal enforcement to back it up. To those familiar with these things that right isn't even controversial.
The disputed version in this case is a mere five years old. Over a longer time that could encompass the entire validity period of a copyright we could easily see such a thing bounce back and forth many times over without ever being discovered. A bot could do some of the search for infringing material; it may even look through archived and archaic versions of a document. I believe that at some point any such processes reach a limit. That broader solution will need to be more imaginative than more police work.
Ray
wikimedia-l@lists.wikimedia.org