On 8/26/07, Andrew Garrett andrew@epstone.net wrote:
Because bots are an immediate solution that people can write within their comfort-zone, whereas learning the MediaWiki codebase would be much more effort; as well as thinking that you need to either grovel, or be in an obscure clique of developers to have the patch applied. (They are wrong on both points).
It was setup a while ago, at the time we didn't have a column for the sha1 data on non-deleted images... I saw someone setting up a bot to watch for naughty things in new uploads, decided I could help make it better and shortly after I was able to give them a quick HTTP api to look up deleted edits by hash. So that was the initial reason.
And of course, it is more comfortable for a lot of people to code outside of MediaWiki. Lazyness? Perhaps. But the changes we would have needed at the time would include a schema change. Time to gratification inside MW: At least a couple of weeks. Externally, we're talking about minutes.
For workflow reasons it's actually much better to accomplish this via bots and tagging. Right now what we're doing is adding a tag like:
{{deletedimage|old image name|old deletion reason}} which expands to a handy link to the history of the deleted duplicate.
Once someone has validated that the image is okay and doesn't need to be deleted again they change the tag to {{deletedimage|old image name|..reason|verified=~~~~}} and the tag changes the category that the image is in.
We could trivially extend mediawiki to display the links to deleted duplicates, and/or non-deleted duplicates, but extending mediawiki to also participate in the workflow automation is a far more ambitious project.
It's also the case that we're doing more than just duplicate matching. For example, here are things which are already done or are being worked on:
*File type double-checking (currently offsetting bug 10823) *Malicious content scanning (with clamav) *Automated google image lookup (google image search the file name, grab the top couple results and compare hashes) *Image fingerprinting (to detect non-bit-identical duplicates) *Suspect EXIF data (corbis, getty, AP copyright data in exif tags). etc.
I'm not sure that putting all of that into MediaWiki makes sense. A lot of it works best asynchronously. A lot of it works best as part of a workflow where software and people work as peers, and we don't really have good ways for the mediawiki software to participate in workflows today.
Even things like "this was deleted as a copyvio don't upload it again" works best as a lagged process. Hard security would just result in the uploader figuring out he can twiddle a single bit in the file and upload it.