Am 04.12.2014 19:39, schrieb Jonas Öberg:
Hi James,
* byte-for-byte identical
That's something probably best done by WMF staff themselves, I think a
simple md5 comparison would give quite a few matches. Doing on the WMF
side would alleviate the need to transfer large amounts of data.
This is happening automatically: the SHA1 hash of every file is computed on
upload, and placed in the img_sha1 field on the database. I believe this is used
to warn users who try to upload an exact duplicate, but I'm not sure this is
true. Anyway, *exact* duplicates can easily be found in the database by anyone
who has an account on toollabs. The relevant query is:
select A.img_name, A.img_sha1, B.img_name from image as A join image as B on
A.img_sha1 = B.img_sha1 and A.img_name < B.img_name;
Having a list of "effective" duplicates, such as the same image in slightly
different resolution or compression, would of course be very interesting.
-- daniel