As an example where the current behaviour of GWTools allowing duplicates is a problem; my NYPL uploads are very large files (up to ~300MB images), unfortunately there are instances where the library is giving multiple identities to the same scanned image: Three identical duplicates of a map as uploaded by GWT, two must be deleted at some point: 1. https://commons.wikimedia.org/wiki/File:Carta_dell%27_Egitto,_Sudan,_Mar_Ros... 2. https://commons.wikimedia.org/wiki/File:Carta_general_del_Oceano_Atlantico_%... 3. https://commons.wikimedia.org/wiki/File:Cartagena_NYPL1505044.tiff
The example file is 97MB and to test for this duplicate using the API myself, I would have to locally download a file, calculate the SHA-1 and then query the Commons API for possible duplicates. This would assume that the EXIF data had not been changed. Considering the sizes of the files and that this is a batch upload of more than 10,000 images, this is not practical and would in effect make the GWT irrelevant as I could then upload my local copy without bothering to create an xml and set up GWT.
Other checks I run when preparing my xml, such as by filename and NYPL unique ID, cannot find these duplicates. I currently have no idea how many digitally identical duplicates the GWT has allowed in the NYPL uploads, this is now a longer term post-upload housekeeping issue.
Fæ