As an example where the current behaviour of GWTools allowing
duplicates is a problem; my NYPL uploads are very large files (up to
~300MB images), unfortunately there are instances where the library is
giving multiple identities to the same scanned image:
Three identical duplicates of a map as uploaded by GWT, two must be
deleted at some point:
1.
https://commons.wikimedia.org/wiki/File:Carta_dell%27_Egitto,_Sudan,_Mar_Ro…
2.
https://commons.wikimedia.org/wiki/File:Carta_general_del_Oceano_Atlantico_…
3.
https://commons.wikimedia.org/wiki/File:Cartagena_NYPL1505044.tiff
The example file is 97MB and to test for this duplicate using the API
myself, I would have to locally download a file, calculate the SHA-1
and then query the Commons API for possible duplicates. This would
assume that the EXIF data had not been changed. Considering the sizes
of the files and that this is a batch upload of more than 10,000
images, this is not practical and would in effect make the GWT
irrelevant as I could then upload my local copy without bothering to
create an xml and set up GWT.
Other checks I run when preparing my xml, such as by filename and NYPL
unique ID, cannot find these duplicates. I currently have no idea how
many digitally identical duplicates the GWT has allowed in the NYPL
uploads, this is now a longer term post-upload housekeeping issue.
Fæ
--
faewik(a)gmail.com
https://commons.wikimedia.org/wiki/User:Fae