The message below was sent to the Board today. Would implementing some sort of automatic copyvio checker be feasible?
The second part of the email suggests it is too difficult to contact us about copyright violations. With the addition of the "contact us" link in the sidebar, I thought this would stop being a problem. Is there any other way of making it easier?
Angela.
---- Forwarded message ----
In regards to the continuing copyright issues because some members do not respect copyrights, I might recommend implementing something like what http://copyscape.com uses. From what I can tell, they use a Google API to do a search of text found in one page to see what other pages have the same text. Using a similar methodology, you could flag new pages that are substantially like pages that exist on the Internet for further review. While this wouldn't tackle all of the copyright violations, it would go a long way towards making it easier to weed out blatant violations like the one I reported.
The issue of some individuals having absolutely no respect for copyrights and plagiarism is a serious problem that Wikipedia needs to address. Some people seem to think that Wikipedia is their personal means of bringing down copyright laws and "freeing" content. This is a shame because these individuals threaten the long term possibilities for Wikipedia.
On a related note, it should be easier to report copyright violations on the Wikipedia website. The current set up is tremendously burdensome to figure how to report copyright violations. There needs to be a simple link from all pages to a simple contact form that allows one to report a violation without having any knowledge of how Wikipedia works. Doing this would put members on notice that Wikipedia isn't a rogue operation where anything goes and that it takes copyright issues seriously.
On Monday 20 June 2005 21:57, Angela wrote:
The message below was sent to the Board today. Would implementing some sort of automatic copyvio checker be feasible?
I have done something similar for the German Wikipedia:
http://www.itp.uni-hannover.de/~krohn/wscan.html.utf8
it reads all newpages from German Wikipedia, shows the beginning of the text and some statistics (and guesses which links to other articles might be interesting). Also it takes parts of some sentences and checks whether they appear somewhere in the internet (btw 5 to 6 consecutive words are almost unique).
Finally the output is sorted by the number of hits ("Fundstellen"). I have several ideas how to improve the script further (e.g. whitelists), but right now I do not have the time to do this.
Nevertheless if someone is interested I am glad to send him the GPLed source code (python) or surely can give some advise.
best regards, Marco
P.S. google was so kind to extend my google key to 7000 requests per day (the standard google key only allows 1000 requests per day which is not sufficient)
On 6/20/05, Angela beesley@gmail.com wrote:
---- Forwarded message ----
In regards to the continuing copyright issues because some members do not respect copyrights, I might recommend implementing something like what http://copyscape.com uses. From what I can tell, they use a Google API to do a search of text found in one page to see what other pages have the same text.
Well the first problem would be the limit the Google API has on requests per day, this is 1000 and if it would be used for the while site the requests would be alot more than that.
Other than that it's just a matter of coding it;)
On Monday 20 June 2005 23:18, Ævar Arnfjörð Bjarmason wrote:
Well the first problem would be the limit the Google API has on requests per day, this is 1000 and if it would be used for the while site the requests would be alot more than that.
I asked google for an extended key because my tool needs about 3000-4000 requests per day. After a short email explaining the reasons and asking for 7000 req/day I got an extensions within hours :-)
(7000 should be sufficient for checking the new articles for "en")
best regards, Marco
On 6/20/05, Marco Krohn marco.krohn@web.de wrote:
On Monday 20 June 2005 23:18, Ævar Arnfjörð Bjarmason wrote:
Well the first problem would be the limit the Google API has on requests per day, this is 1000 and if it would be used for the while site the requests would be alot more than that.
I asked google for an extended key because my tool needs about 3000-4000 requests per day. After a short email explaining the reasons and asking for 7000 req/day I got an extensions within hours :-)
(7000 should be sufficient for checking the new articles for "en")
Maybe, but if you take into account that it has to be checked for all projects and languages and that you may wish to make several requests per article (to check for more than one string, say if you want to check each paragraph for likely copyvios) it grows quickly.
Not that I think that would be a showstopper when it comes down to it all, I'm sure google will increase their limit.
On Mon, 20 Jun 2005 21:57:37 +0200, Angela wrote:
The second part of the email suggests it is too difficult to contact us about copyright violations. With the addition of the "contact us" link in the sidebar, I thought this would stop being a problem. Is there any other way of making it easier?
How about clarifying policies so editors actually know what they are supposed to do? I've been harping about this on quite a few occasions: All clearly spelled out policy is concerned with "the most recent edit is a copyvio". What if it is only noticed several or many edits later?
I have looked for WP policy on this very issue, and like others who did the same, I could not find anything conclusive.
In my opinion, we must revert to the last version that was not a copyvio, and then salvage whatever we can from later edits. However, in my experience, mine is a minority position. Which is why I gave up on tracking down copyvios -- what's the point if people reinstate the copyvio version and then change a couple of words to disguise what they did?
Roger
wikitech-l@lists.wikimedia.org