On 12/21/06, Fastfission fastfission@gmail.com wrote:
On 11/24/06, Earle Martin wikipedia@downlode.org wrote:
Whether the copyvio is an inward or outward bound one in each case is sadly beyond the scope of my programming skills, so I leave that to you.
I don't think this is a programming program -- its a conceptual problem.
A good copyvio bot -- one which doesn't waste one's time with false positives or outward copyvios -- would be one which monitors NEW additions and did not try to parse previously existing material. If someone says, "This is new, original text" but it gets Google hits, it is almost certainly copy-and-pasted (whether that makes it officially a copyvio still needs to be decided, but it is a vastly simpler problem than the previous one).
This is already being done
Trying to go through the entire database by finding random pages and taking random lines seems extremely hit-and-miss to me, and if you have to worry about mirrors and false positives then I can't see how that would possibly be productive. The odds of finding a copyvio are going to be quite low, and the amount of time needed to sort through them is going to be quite high.
Daniel Brandt managed it.