[Wikimedia-l] Copyright infringement - The real elephant in the room

Tue Nov 19 01:07:48 UTC 2013

On 11/16/2013 09:04 AM, Anthony Cole wrote:
> The problem of false positives from mirrors doesn't exist if we scan edits
> as they are made.

Agreed.  However, that example is a legal, attributed (at least on the 
talk page) copy from a third-party freely licensed text, not a false 
positive copy from a Wikipedia mirror.

> Maggie says here<https://en.wikipedia.org/wiki/Wikipedia:Administrators%27_noticeboard#Emergency_block_of_an_editor_with_which_I_have_been_previously_involved>that
> copyright bots populate
> WP:SCV <https://en.wikipedia.org/wiki/Wikipedia:SCV> So a
> similarly-configured bot could scan recent changes and tag suspected
> copyvios in watchlists and page histories like suspected vandalism is
> currently tagged.

The suspected vandalism checks that actually tag the edit (e.g. "Tag: 
possible vandalism")  are based on AbuseFilter checks.  These are 
relatively fast determinations that consider the text of the edit (e.g. 
regexes for strings of curse words, or meaningless repeating 
characters), and comparisons to the previous version (blanked the 
section, blanked the page).

As far as I know, regular AbuseFilter rules can not hit a database or 
web search to check for copyright violations.  An extension could in 
theory do this.  But there would possibly be performance problems, since 
AbuseFilter runs on the actual server (not just some bot's computer) on 
every edit.

It is possible for a bot to scan every edit; it just can't use 
AbuseFilter tags.

Matt Flaschen