Just casually clicking on some of the results you pulled, I can see at a glance that lots of these duplicate uploads are caused because the oldest version is virtually "unfindable" for Wikimedians; i.e. it is not in any category whatsoever.

On Thu, Dec 4, 2014 at 7:39 PM, Jonas Öberg <jonas@commonsmachinery.se> wrote:
Hi James,

> * byte-for-byte identical

That's something probably best done by WMF staff themselves, I think a
simple md5 comparison would give quite a few matches. Doing on the WMF
side would alleviate the need to transfer large amounts of data.

For the rest, that's something that require a few API lookups only to
get the relevant information (size etc). I can also imagine that it
might be useful to take the results we've gotten, apply some secondary
matching on the pairs that we've identified. Such a secondary matching
could be more specific than ours to narrow down to true duplicates,
and also take size into consideration.

That's beyond our need though: we're happy with the information we
have, and while it would contribute to our work to eliminate
duplicates in Commons, it's not critical right now. But if someone is
interested in working with our results or our data, we'd be happy to
collaborate around that if it would benefit Commons.

Sincerely,
Jonas

_______________________________________________
Commons-l mailing list
Commons-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/commons-l