[Wikimedia-l] Copy and paste

Tom Morris tom at tommorris.org
Thu Oct 18 12:31:32 UTC 2012

On Thu, Oct 18, 2012 at 6:26 AM, James Heilman <jmh649 at gmail.com> wrote:
> We really need a plagiarism detection tool so that we can make sure our
> sources are not simply "copy and pastes" of older versions of Wikipedia.
> Today I was happily improving our article on pneumonia as I have a day off.
> I came across a recommendation that baby's should be suction at birth to
> decrease their risk of pneumonia with a {{cn}} tag. So I went to Google
> books and up came a book that supported it perfectly. And than I noticed
> that this book supported the previous and next few sentences as well. It
> also supported a number of other sections we had in the article but was
> missing our references. The book was selling for $340 a copy. Our articles
> have improved a great deal since 2007 and yet school are buying copy edited
> version of Wikipedia from 5 years ago. The bit about suctioning babies at
> birth is was wrong and I have corrected it. I think we need to get this
> news out. Support Wikipedia and use the latest version online!

It's sort of unrelated, but there's a project called Common Crawl:


It is trying to produce an "open crawl of the web" (much as Google,
Bing etc. have for their search engines).

Now that the copyvio bot is down, I'm wondering if someone would be
interested in building something that used the Common Crawl database,
or whether that'd be practical.

Tom Morris

More information about the Wikimedia-l mailing list