Re: [Glamtools] Advice on uploading a batch from a GLAM when individuals have already uploaded some of that GLAMs images?

6 May 2014

As an example where the current behaviour of GWTools allowing
duplicates is a problem; my NYPL uploads are very large files (up to
~300MB images), unfortunately there are instances where the library is
giving multiple identities to the same scanned image:
Three identical duplicates of a map as uploaded by GWT, two must be
deleted at some point:
1.
https://commons.wikimedia.org/wiki/File:Carta_dell%27_Egitto,_Sudan,_Mar_Ro…
2.
https://commons.wikimedia.org/wiki/File:Carta_general_del_Oceano_Atlantico_…
3. https://commons.wikimedia.org/wiki/File:Cartagena_NYPL1505044.tiff

The example file is 97MB and to test for this duplicate using the API
myself, I would have to locally download a file, calculate the SHA-1
and then query the Commons API for possible duplicates. This would
assume that the EXIF data had not been changed. Considering the sizes
of the files and that this is a batch upload of more than 10,000
images, this is not practical and would in effect make the GWT
irrelevant as I could then upload my local copy without bothering to
create an xml and set up GWT.

Other checks I run when preparing my xml, such as by filename and NYPL
unique ID, cannot find these duplicates. I currently have no idea how
many digitally identical duplicates the GWT has allowed in the NYPL
uploads, this is now a longer term post-upload housekeeping issue.

Fæ
-- 
faewik(a)gmail.com https://commons.wikimedia.org/wiki/User:Fae

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Re: [Glamtools] Advice on uploading a batch from a GLAM when individuals have already uploaded some of that GLAMs images?