I am using Wikimedia APIs to create a gallery of duplicates and routinely clean them. You can see the results here.

https://commons.wikimedia.org/wiki/User:Sreejithk2000/Duplicates

The page also has a link to the script. If anyone is interested in using this script, let me know and I can work with you to customize it.

- Sreejith K.


On Thu, Dec 4, 2014 at 2:46 PM, Fæ <faewik@gmail.com> wrote:
On 4 December 2014 at 18:39, Jonas Öberg <jonas@commonsmachinery.se> wrote:
>> * byte-for-byte identical
>
> That's something probably best done by WMF staff themselves, I think a
> simple md5 comparison would give quite a few matches. Doing on the WMF
> side would alleviate the need to transfer large amounts of data.

Volunteers can do this using simple database queries, which is a lot
more efficient than pulling data out of the API. For example while
writing this email I knocked out a query to show all non-trivial
images (>2 pixels wide) on Commons with at least *3* files having the
same SHA1 checksum and showing each image just once. The matching
files are listed at the bottom of each image page on Commons.
Interestingly, this shows that most of the 226 files have been from an
upload of Gospel illustrations. The low number seems reassuring
considering the size of Commons. The files are reported in descending
order by image resolution.

Report: http://commons.wikimedia.org/w/index.php?title=User:F%C3%A6/sandbox&oldid=141460887

On its own this is an interesting list to use as a backlog for fixes.
Listing identical duplicates with 2 or more files matching would be
simpler but longer; at the moment I count 3,279 files like this on
Commons which took over 9 minutes to run. :-)

Fae
--
faewik@gmail.com https://commons.wikimedia.org/wiki/User:Fae

_______________________________________________
Commons-l mailing list
Commons-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/commons-l