On 10/22/05, Elisabeth Bauer <elian(a)djini.de> wrote:
Today, I accidentally discovered a massive amount of
copyvios from
Microsoft Encarta in the german wikipedia by a single user (who
unfortunately used an excessive amount of sockpuppet accounts). For
anyone interested, the list of detected copyvios is here:
*
http://de.wikipedia.org/wiki/Benutzer:Peterlustig/Encarta_URVs
What we did was: go through all the article contributions by this user
manually, pick the bigger text insertions and then look up the entries
in encarta and other sources. We discovered a lot of copyvios like this,
but since I don't even know all the sockpuppets this user had, we have
no idea how many more there are still in wikipedia.
I wonder if we could use the toolserver for a good copyvio check system.
Wikimedia germany could easily sponsor Britannica, Encarta and Brockhaus
DVDs which serve as a text base for comparison (if we manage to access
the texts somehow).
It would be great if someone or more people together would like to work
on this - apart from vandal fighting tools this should have top priority.
One of the little test tools I've used in my text classification tools
here is a rolling hash. You take all of the Wikipedia article text,
strip out markup, and canonicize the text (squash whitespace, simplify
punctuation, flatten case).. then grab four contiguous words, compute
a hash (I was using the first 32bits of a MD5, but something faster
could be used), store it, slide the window one word over one word,
repeat.
Eventually you get one 32bit value for every word in wikipedia. Throw
out the most frequent values (since they tend to just be common text).
The raw data from this is only about 1gb for en as I recall.
Run the process against your comparison/check corpus... Now sort
articles by the number of hits and compare by hand.