Hi all,
Today, I accidentally discovered a massive amount of copyvios from Microsoft Encarta in the german wikipedia by a single user (who unfortunately used an excessive amount of sockpuppet accounts). For anyone interested, the list of detected copyvios is here: * http://de.wikipedia.org/wiki/Benutzer:Peterlustig/Encarta_URVs
What we did was: go through all the article contributions by this user manually, pick the bigger text insertions and then look up the entries in encarta and other sources. We discovered a lot of copyvios like this, but since I don't even know all the sockpuppets this user had, we have no idea how many more there are still in wikipedia.
I wonder if we could use the toolserver for a good copyvio check system. Wikimedia germany could easily sponsor Britannica, Encarta and Brockhaus DVDs which serve as a text base for comparison (if we manage to access the texts somehow).
It would be great if someone or more people together would like to work on this - apart from vandal fighting tools this should have top priority.
greetings, elian
On 10/22/05, Elisabeth Bauer elian@djini.de wrote:
Today, I accidentally discovered a massive amount of copyvios from Microsoft Encarta in the german wikipedia by a single user (who unfortunately used an excessive amount of sockpuppet accounts). For anyone interested, the list of detected copyvios is here:
What we did was: go through all the article contributions by this user manually, pick the bigger text insertions and then look up the entries in encarta and other sources. We discovered a lot of copyvios like this, but since I don't even know all the sockpuppets this user had, we have no idea how many more there are still in wikipedia.
I wonder if we could use the toolserver for a good copyvio check system. Wikimedia germany could easily sponsor Britannica, Encarta and Brockhaus DVDs which serve as a text base for comparison (if we manage to access the texts somehow).
It would be great if someone or more people together would like to work on this - apart from vandal fighting tools this should have top priority.
One of the little test tools I've used in my text classification tools here is a rolling hash. You take all of the Wikipedia article text, strip out markup, and canonicize the text (squash whitespace, simplify punctuation, flatten case).. then grab four contiguous words, compute a hash (I was using the first 32bits of a MD5, but something faster could be used), store it, slide the window one word over one word, repeat.
Eventually you get one 32bit value for every word in wikipedia. Throw out the most frequent values (since they tend to just be common text). The raw data from this is only about 1gb for en as I recall.
Run the process against your comparison/check corpus... Now sort articles by the number of hits and compare by hand.
toolserver-l@lists.wikimedia.org