In light of the editor retention problem, I suggest we have to be very careful with any kind of “plagiarism detector” software because we have real subject matter experts among our editors. I’m aware of members of local history societies who have had issues with copyright violation because they have content on their own websites which they then contribute to Wikipedia. It’s not a copyright violation because it’s their own work, but it was deleted, they were accused of copyright violation and they were naturally very unhappy about both. Being new users they did not know any way to get this redressed, they asked me for help and I got nowhere with the editor who deleted the material who would not accept their assertion that they were the original authors (how on earth could they prove it?). As a result, none of them are now active editors. Having had a whole bunch of my own images nearly deleted from Commons because they appear on my own website (despite my user name being my real name and my real name is all over my website), I know how they feel about having accusations of copyright violation all over your contributions – it’s really offensive. Strangely we have no way to whitelist particular websites in relation to particular users (in theory, you’d want to be able to whitelist books and off-line resources too but in practice “copies” from these are far less likely to be noticed), so the same problem can arise again and again for an individual contributor.

So I would be very hesitant about putting any visible tag on an article suggesting it was a copyright violation (as it seems to me it is both offensive and potentially libellous to the editor who has in good faith contributed their own work). I think any concern about copyright has to be first raised with the editor involved as a question NOT an accusation. And I note that it is often very difficult to communicate with new/occasional editors as they often have no email address associated with their account and they don’t see talk page message banners unless they are remember-me logged-in. It’s ironic that at a time a contributor is most likely to want/need help, we are in the worst position to know they want it or offer it if we see they need it.

So, I’m with Jane on this one. It’s easy enough to detect a lot of potential copyright violations automatically. What’s hard and very much a manual task is confirming it really is a copyright violation and, where required, educating the contributor. I think there’s a real danger to automating the first part without a good solution to the second part. We have far too many editors who use tools as weapons already, so I am reluctant to give them more weapons.

Kerry