I just ran into a fairly significant copyright infringer that did a wholesale copy of text on Joseph Wilson Swan and Internet Chess Club and a somewhat complex job involving several websites with the Light bulb article (see http://www.wikipedia.com/wiki/Talk:Light_bulb). The IP of the contributor who did this was blocked as a result (pending his/her asking the list for a block removal).
The internet chess article is now fine because several users have since edited the text extensively (removing "advertising" like qualities - they were not aware of the copyright violation). The light bulb article is similar but there still are some entire sentences that are word for word the same (not to mention the obvious parentage of many other sentences).
My question is this; would it be possible for a bot to be programmed to search Google for longish strings of text that are inserted into wikipedia and then log the results when strings are matched? This bot could crawl through new diffs in Recent Changes and log possible violations that would then have to be reviewed by a human to see if there is in fact any violation.
This of course will pick up a lot of public domain text but if we also encouraged users to 'cite their sources' these legit uses of text can be quickly skipped over in the log. Common phrases would also be logged if the bot wasn't programmed to minimize this by maybe centering a string search on a period (thus spanning parts of two sentences).
The above copyright violations happened several days ago and were obviously missed by the militia. Subsequently several users edited and expanded the offending text which causes major version control headaches when there still are obvious violations mixed with legit contribs.
Having a bot do much of the work would free up human resources (this also would reduce duplication of effort -- sometimes several people will perform a Google test on a suspicious entry while at other times nobody does a test).
Yes I know this is an example of a bot that could be a good community member. I take back my previous comment (it all depends on what the bot does).
-- Daniel Mayer (aka mav)
wikipedia-l@lists.wikimedia.org