On 8/25/07, Brion Vibber brion@wikimedia.org wrote:
Ok, the image metadata for sha-1 hashes is now updated when actually needed for deletion (or in a batch process) instead of on metadata read, which was what was bogging down the system.
Please setup a batch job to eventually populate the sha-1 metadata for non-deleted images. We'd like to use it for duplicate image detection.
We're already doing this against the deleted images... A bot downloads the image, computes that sha-1, checks the filearchive table based on the SHA1, and if there is a match it complains in IRC and the new image is tagged.
This would be easier to perform if we could skip the download/compute sha1 step.. and being able to test against non-deleted images would be handy too.
Of course, it doesn't detect anything that isn't bit-identical, but catching bit-identical duplicates is still useful.