[WikiEN-l] Copyright Violation Bot

Fastfission fastfission at gmail.com
Thu Dec 21 15:13:52 UTC 2006


On 11/24/06, Earle Martin <wikipedia at downlode.org> wrote:
> Whether the copyvio is an inward or outward bound one in each case is
> sadly beyond the scope of my programming skills, so I leave that to
> you.

I don't think this is a programming program -- its a conceptual problem.

A good copyvio bot -- one which doesn't waste one's time with false
positives or outward copyvios -- would be one which monitors NEW
additions and did not try to parse previously existing material. If
someone says, "This is new, original text" but it gets Google hits, it
is almost certainly copy-and-pasted (whether that makes it officially
a copyvio still needs to be decided, but it is a vastly simpler
problem than the previous one).

Trying to go through the entire database by finding random pages and
taking random lines seems extremely hit-and-miss to me, and if you
have to worry about mirrors and false positives then I can't see how
that would possibly be productive. The odds of finding a copyvio are
going to be quite low, and the amount of time needed to sort through
them is going to be quite high. Monitoring RC for copyvio seems much
simpler by comparison -- if finding previously-existing copyvios is
going to be an impossible effort to automate successfully (which I
think it is), preventing new copyvios would be comparatively easier.

(I started, ages ago, to work on a program which could do things like
this, but got bogged down and lacked time. Sigh...)

Just my two cents...
FF



More information about the WikiEN-l mailing list