On 8/26/07, Andrew Garrett <andrew(a)epstone.net> wrote:
Because bots are an immediate solution that people can
write within
their comfort-zone, whereas learning the MediaWiki codebase would be
much more effort; as well as thinking that you need to either grovel,
or be in an obscure clique of developers to have the patch applied.
(They are wrong on both points).
It was setup a while ago, at the time we didn't have a column for the
sha1 data on non-deleted images... I saw someone setting up a bot to
watch for naughty things in new uploads, decided I could help make it
better and shortly after I was able to give them a quick HTTP api to
look up deleted edits by hash. So that was the initial reason.
And of course, it is more comfortable for a lot of people to code
outside of MediaWiki. Lazyness? Perhaps. But the changes we would
have needed at the time would include a schema change. Time to
gratification inside MW: At least a couple of weeks. Externally, we're
talking about minutes.
For workflow reasons it's actually much better to accomplish this via
bots and tagging. Right now what we're doing is adding a tag like:
{{deletedimage|old image name|old deletion reason}} which expands to a
handy link to the history of the deleted duplicate.
Once someone has validated that the image is okay and doesn't need to
be deleted again they change the tag to {{deletedimage|old image
name|..reason|verified=~~~~}} and the tag changes the category that
the image is in.
We could trivially extend mediawiki to display the links to deleted
duplicates, and/or non-deleted duplicates, but extending mediawiki to
also participate in the workflow automation is a far more ambitious
project.
It's also the case that we're doing more than just duplicate matching.
For example, here are things which are already done or are being
worked on:
*File type double-checking (currently offsetting bug 10823)
*Malicious content scanning (with clamav)
*Automated google image lookup (google image search the file name,
grab the top couple results and compare hashes)
*Image fingerprinting (to detect non-bit-identical duplicates)
*Suspect EXIF data (corbis, getty, AP copyright data in exif tags).
etc.
I'm not sure that putting all of that into MediaWiki makes sense. A
lot of it works best asynchronously. A lot of it works best as part of
a workflow where software and people work as peers, and we don't
really have good ways for the mediawiki software to participate in
workflows today.
Even things like "this was deleted as a copyvio don't upload it again"
works best as a lagged process. Hard security would just result in the
uploader figuring out he can twiddle a single bit in the file and
upload it.