On 11/24/06, Earle Martin wikipedia@downlode.org wrote:
Whether the copyvio is an inward or outward bound one in each case is sadly beyond the scope of my programming skills, so I leave that to you.
I don't think this is a programming program -- its a conceptual problem.
A good copyvio bot -- one which doesn't waste one's time with false positives or outward copyvios -- would be one which monitors NEW additions and did not try to parse previously existing material. If someone says, "This is new, original text" but it gets Google hits, it is almost certainly copy-and-pasted (whether that makes it officially a copyvio still needs to be decided, but it is a vastly simpler problem than the previous one).
Trying to go through the entire database by finding random pages and taking random lines seems extremely hit-and-miss to me, and if you have to worry about mirrors and false positives then I can't see how that would possibly be productive. The odds of finding a copyvio are going to be quite low, and the amount of time needed to sort through them is going to be quite high. Monitoring RC for copyvio seems much simpler by comparison -- if finding previously-existing copyvios is going to be an impossible effort to automate successfully (which I think it is), preventing new copyvios would be comparatively easier.
(I started, ages ago, to work on a program which could do things like this, but got bogged down and lacked time. Sigh...)
Just my two cents... FF