Hi, On behalf of the Amsterdam Museum I'm prepping a batch upload of about 300 images of their collection of paintings. They've made the selection.
I note that a number of their paintings have already been uploaded by individual Commonists to here https://commons.wikimedia.org/wiki/Category:Paintings_in_the_Amsterdam_Museu...
Question: Should I upload all images in "my" batch anyway even though this risks duplicating images? Is there a best practise for cases like this?
Cheers, David Haskiya
David Haskiya Product Development Manager
T: +31 (0)70 314 0696 M: +31 (0)64 217 2542 E: david.haskiya@europeana.eu Skype: davidhaskiya
Europeanahttp://www.europeana.eu/ makes Europe's culture available for all, across borders and generations and for creative re-use - follow how at #AllezCulturehttp://bit.ly/17mnbL7
Disclaimer: This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please notify the system manager. If you are not the named addressee you should not disseminate, distribute or copy this email. Please notify the sender immediately by email if you have received this email by mistake and delete this email from your system.
I have had many issues around this in the past. If the images are the same in quality/resolution then avoid duplicating what is currently on Commons. However if your versions are, in your view, better quality then there is no problem uploading them as they are not true duplicates. Digitally identical duplicates should be rejected automatically at upload as the files have matching SHA-1 checks.
For most of my batch uploads I do a check of whatever unique ID is suitable to see if there are matches. This can be highly useful when re-running uploads as a check of matching filenames or image page text is a lot less processing/data volumes than downloading the image file to create SHA-1 values.
PS this feels like "advanced class" techniques. Apologies if I'm a crappy teacher. :-)
Fae
On 1 May 2014 13:46, David Haskiya david.haskiya@europeana.eu wrote:
Hi, On behalf of the Amsterdam Museum I'm prepping a batch upload of about 300 images of their collection of paintings. They've made the selection.
I note that a number of their paintings have already been uploaded by individual Commonists to here https://commons.wikimedia.org/wiki/Category:Paintings_in_the_Amsterdam_Museu...
Question: Should I upload all images in "my" batch anyway even though this risks duplicating images? Is there a best practise for cases like this?
Cheers, David Haskiya
*David Haskiya*
Product Development Manager
T: +31 (0)70 314 0696 M: +31 (0)64 217 2542 E: david.haskiya@europeana.eu
Skype: davidhaskiya
*Europeana http://www.europeana.eu/ makes Europe’s culture available for all, across borders and generations and for creative re-use – follow how at* *#AllezCulture* http://bit.ly/17mnbL7
Disclaimer: This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please notify the system manager. If you are not the named addressee you should not disseminate, distribute or copy this email. Please notify the sender immediately by email if you have received this email by mistake and delete this email from your system.
Glamtools mailing list Glamtools@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/glamtools
my experience / practice was to upload all, and then manually add to other images field, then curate the metadata among images.
you may have multiple versions from different sources, i.e. google art, museum flickr feed,
some duplicates got deleted, but there is no process for image curation among duplicates. (yet)
i found in general that there were no tiffs, only some high res jpgs around 20MB taken from tiffs.
jim hayes
On Fri, May 2, 2014 at 4:40 AM, Fæ faewik@gmail.com wrote:
I have had many issues around this in the past. If the images are the same in quality/resolution then avoid duplicating what is currently on Commons. However if your versions are, in your view, better quality then there is no problem uploading them as they are not true duplicates. Digitally identical duplicates should be rejected automatically at upload as the files have matching SHA-1 checks.
For most of my batch uploads I do a check of whatever unique ID is suitable to see if there are matches. This can be highly useful when re-running uploads as a check of matching filenames or image page text is a lot less processing/data volumes than downloading the image file to create SHA-1 values.
PS this feels like "advanced class" techniques. Apologies if I'm a crappy teacher. :-)
Fae
On 1 May 2014 13:46, David Haskiya david.haskiya@europeana.eu wrote:
Hi, On behalf of the Amsterdam Museum I'm prepping a batch upload of about 300 images of their collection of paintings. They've made the selection.
I note that a number of their paintings have already been uploaded by individual Commonists to here https://commons.wikimedia.org/wiki/Category:Paintings_in_the_Amsterdam_Museu...
Question: Should I upload all images in "my" batch anyway even though this risks duplicating images? Is there a best practise for cases like this?
Cheers, David Haskiya
*David Haskiya*
Product Development Manager
T: +31 (0)70 314 0696 M: +31 (0)64 217 2542 E: david.haskiya@europeana.eu
Skype: davidhaskiya
*Europeana http://www.europeana.eu/ makes Europe’s culture available for all, across borders and generations and for creative re-use – follow how at* *#AllezCulture* http://bit.ly/17mnbL7
Disclaimer: This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please notify the system manager. If you are not the named addressee you should not disseminate, distribute or copy this email. Please notify the sender immediately by email if you have received this email by mistake and delete this email from your system.
Glamtools mailing list Glamtools@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/glamtools
-- faewik@gmail.com https://commons.wikimedia.org/wiki/User:Fae Personal and confidential, please do not circulate or re-quote.
Glamtools mailing list Glamtools@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/glamtools
On May 2, 2014 5:40 AM, "Fæ" faewik@gmail.com wrote:
I have had many issues around this in the past. If the images are the
same in quality/resolution then avoid duplicating what is currently on Commons. However if your versions are, in your view, better quality then there is no problem uploading them as they are not true duplicates. Digitally identical duplicates should be rejected automatically at upload as the files have matching SHA-1 checks.
Thats from normal upload. Gwtoolset may be different. Anyways they can be dealt with after the fact too if they are exactly identical as its easy to detect later.
For most of my batch uploads I do a check of whatever unique ID is
suitable to see if there are matches. This can be highly useful when re-running uploads as a check of matching filenames or image page text is a lot less processing/data volumes than downloading the image file to create SHA-1 values.
PS this feels like "advanced class" techniques. Apologies if I'm a crappy
teacher. :-)
Fae
On 1 May 2014 13:46, David Haskiya david.haskiya@europeana.eu wrote:
Hi, On behalf of the Amsterdam Museum I'm prepping a batch upload of about
300 images of their collection of paintings. They've made the selection.
I note that a number of their paintings have already been uploaded by
individual Commonists to here https://commons.wikimedia.org/wiki/Category:Paintings_in_the_Amsterdam_Museu...
Question: Should I upload all images in "my" batch anyway even though
this risks duplicating images? Is there a best practise for cases like this?
Cheers, David Haskiya
David Haskiya
Product Development Manager
T: +31 (0)70 314 0696 M: +31 (0)64 217 2542 E: david.haskiya@europeana.eu
Skype: davidhaskiya
Europeana makes Europe’s culture available for all, across borders and
generations and for creative re-use – follow how at #AllezCulture
Disclaimer: This email and any files transmitted with it are
confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please notify the system manager. If you are not the named addressee you should not disseminate, distribute or copy this email. Please notify the sender immediately by email if you have received this email by mistake and delete this email from your system.
Glamtools mailing list Glamtools@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/glamtools
-- faewik@gmail.com https://commons.wikimedia.org/wiki/User:Fae Personal and confidential, please do not circulate or re-quote.
Glamtools mailing list Glamtools@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/glamtools
On 03/05/2014, Brian Wolff bawolff@gmail.com wrote:
On May 2, 2014 5:40 AM, "Fæ" faewik@gmail.com wrote:
I have had many issues around this in the past. If the images are the
same in quality/resolution then avoid duplicating what is currently on Commons. However if your versions are, in your view, better quality then there is no problem uploading them as they are not true duplicates. Digitally identical duplicates should be rejected automatically at upload as the files have matching SHA-1 checks.
Thats from normal upload. Gwtoolset may be different. Anyways they can be dealt with after the fact too if they are exactly identical as its easy to detect later.
Just to clarify, does the GWT ignore the SHA-1 based duplicate warning and upload the digitally identical duplicate as a new file?
If it does, rather than skipping it or giving a warning, then this seems like a bug.
Fae
GWToolset ignores the SHA-1 duplication warning. as far as i remember, the intent is to make sure the source of the mediafile and metadata is from the GLAM.
with kind regards, dan
On May 3, 2014, at 09:36 , Fæ faewik@gmail.com wrote:
On 03/05/2014, Brian Wolff bawolff@gmail.com wrote:
On May 2, 2014 5:40 AM, "Fæ" faewik@gmail.com wrote:
I have had many issues around this in the past. If the images are the
same in quality/resolution then avoid duplicating what is currently on Commons. However if your versions are, in your view, better quality then there is no problem uploading them as they are not true duplicates. Digitally identical duplicates should be rejected automatically at upload as the files have matching SHA-1 checks.
Thats from normal upload. Gwtoolset may be different. Anyways they can be dealt with after the fact too if they are exactly identical as its easy to detect later.
Just to clarify, does the GWT ignore the SHA-1 based duplicate warning and upload the digitally identical duplicate as a new file?
If it does, rather than skipping it or giving a warning, then this seems like a bug.
Fae
faewik@gmail.com https://commons.wikimedia.org/wiki/User:Fae
Glamtools mailing list Glamtools@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/glamtools
Hi Dan,
It probably should still output a warning and list all identical files, so they can be tackled manually after the upload. Giving preference to the media file from the GLAM probably makes sense, but you still want to substitute any other identical files, right?
When manually tackling identical files, the following potential issues should be looked at: - Existing files may already be included into Wikipedia articles - it probably would make sense to replace them by the newly uploaded version - Metadata of existing files may be more complete or complementary to the metadata provided by the GLAM, especially if it has been enhanced by the community (translations, etc.) - it certainly would make sense not to throw away these additional metadata that have been contributed by the community. - There might be derivatives based on existing files - it certainly makes sense to ensure that they can be properly tracked to the original file
This may not be complete; maybe someone actively involved in uploads that have encountered the problem of such duplicates wants to go through the list, complement it and add it to the help/documentation pages...
Have a nice week end!
Beat
-----Original Message----- From: glamtools-bounces@lists.wikimedia.org [mailto:glamtools-bounces@lists.wikimedia.org] On Behalf Of dan entous Sent: Samstag, 3. Mai 2014 10:03 To: Conversations revolving around the development of GLAM Digital Tools Subject: Re: [Glamtools] Advice on uploading a batch from a GLAM when individuals have already uploaded some of that GLAMs images?
GWToolset ignores the SHA-1 duplication warning. as far as i remember, the intent is to make sure the source of the mediafile and metadata is from the GLAM.
with kind regards, dan
On May 3, 2014, at 09:36 , Fæ faewik@gmail.com wrote:
On 03/05/2014, Brian Wolff bawolff@gmail.com wrote:
On May 2, 2014 5:40 AM, "Fæ" faewik@gmail.com wrote:
I have had many issues around this in the past. If the images are the
same in quality/resolution then avoid duplicating what is currently on Commons. However if your versions are, in your view, better quality then there is no problem uploading them as they are not true duplicates. Digitally identical duplicates should be rejected automatically at upload as the files have matching SHA-1 checks.
Thats from normal upload. Gwtoolset may be different. Anyways they can be dealt with after the fact too if they are exactly identical as its easy to detect later.
Just to clarify, does the GWT ignore the SHA-1 based duplicate warning and upload the digitally identical duplicate as a new file?
If it does, rather than skipping it or giving a warning, then this seems like a bug.
Fae
faewik@gmail.com https://commons.wikimedia.org/wiki/User:Fae
Glamtools mailing list Glamtools@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/glamtools
_______________________________________________ Glamtools mailing list Glamtools@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/glamtools
On 03/05/2014, Estermann Beat beat.estermann@bfh.ch wrote:
Hi Dan,
It probably should still output a warning and list all identical files, so they can be tackled manually after the upload. Giving preference to the media file from the GLAM probably makes sense, but you still want to substitute any other identical files, right?
When manually tackling identical files, the following potential issues should be looked at:
- Existing files may already be included into Wikipedia articles - it
probably would make sense to replace them by the newly uploaded version
- Metadata of existing files may be more complete or complementary to the
metadata provided by the GLAM, especially if it has been enhanced by the community (translations, etc.) - it certainly would make sense not to throw away these additional metadata that have been contributed by the community.
- There might be derivatives based on existing files - it certainly makes
sense to ensure that they can be properly tracked to the original file
This may not be complete; maybe someone actively involved in uploads that have encountered the problem of such duplicates wants to go through the list, complement it and add it to the help/documentation pages...
Have a nice week end!
Beat
All good points. I think this is a good bug to document.
I would like to add an issue I have experience with my batch uploads from the US Ministry of Defense (c. 40,000 photographs to date) and the Imperial War Museums (c. 60,000 images?); in both these cases the source website both "refreshes" the image with new versions under the same link and unique identity. This means the SHA-1 is changing over time, sometimes with just the EXIF data changing. An error I used to make on these mass uploads was to rely on the SHA-1 as the means to identify duplicates. The complexity of this problem means that I doubt there is one fixed solution that fits all GLAMs, making the reporting of uploaded duplicates an important feature, and probably the upload behaviour (giving the user an option to overwrite or create duplicates) is a feature that needs improvement to avoid lots of time consuming post-upload housekeeping, along with the inevitable heated volunteer complaints. :-)
As an example of an awkward long term backlog that is part of my legacy of uploads, I have over 500 photographs that I still have to review by hand at: https://commons.wikimedia.org/wiki/Category:Images_from_DoD_uploaded_by_F%C3...
Fae
It probably should still output a warning and list all identical files, so they can be tackled manually after the upload. Giving preference to the media file from the GLAM probably makes sense, but you still want to substitute any other identical files, right?
Am I understand correctly, you suggest to remove volunteer uploaded files by GLAM uploaded files?
charles
Hi Dan,
dan entous schreef op 3-5-2014 10:02:
GWToolset ignores the SHA-1 duplication warning. as far as i remember, the intent is to make sure the source of the mediafile and metadata is from the GLAM.
Are you 100% about this? I'm pretty sure we discussed this as a hard requirement in one of the sprint sessions. Can someone please check/confirm this behaviour in the wild (on Commons)? The tool should skip (sha-1) duplicates by default unless explicitly ordered otherwise by the user.
Maarten
GWT should prevent the upload of duplicates. Not enough users are working on the backlog...
https://bugzilla.wikimedia.org/show_bug.cgi?id=64831
Date: Sat, 3 May 2014 04:29:12 -0300 From: bawolff@gmail.com To: glamtools@lists.wikimedia.org Subject: Re: [Glamtools] Advice on uploading a batch from a GLAM when individuals have already uploaded some of that GLAMs images?
On May 2, 2014 5:40 AM, "Fæ" faewik@gmail.com wrote:
I have had many issues around this in the past. If the images are the same in quality/resolution then avoid duplicating what is currently on Commons. However if your versions are, in your view, better quality then there is no problem uploading them as they are not true duplicates. Digitally identical duplicates should be rejected automatically at upload as the files have matching SHA-1 checks.
Thats from normal upload. Gwtoolset may be different. Anyways they can be dealt with after the fact too if they are exactly identical as its easy to detect later.
For most of my batch uploads I do a check of whatever unique ID is suitable to see if there are matches. This can be highly useful when re-running uploads as a check of matching filenames or image page text is a lot less processing/data volumes than downloading the image file to create SHA-1 values.
PS this feels like "advanced class" techniques. Apologies if I'm a crappy teacher. :-)
Fae
On 1 May 2014 13:46, David Haskiya david.haskiya@europeana.eu wrote:
Hi,
On behalf of the Amsterdam Museum I'm prepping a batch upload of about 300 images of their collection of paintings. They've made the selection.
I note that a number of their paintings have already been uploaded by individual Commonists to here https://commons.wikimedia.org/wiki/Category:Paintings_in_the_Amsterdam_Museu...
Question: Should I upload all images in "my" batch anyway even though this risks duplicating images? Is there a best practise for cases like this?
Cheers,
David Haskiya
David Haskiya
Product Development Manager
T: +31 (0)70 314 0696
M: +31 (0)64 217 2542
E: david.haskiya@europeana.eu
Skype: davidhaskiya
Europeana makes Europe’s culture available for all, across borders and generations and for creative re-use – follow how at #AllezCulture
Disclaimer: This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please notify the system manager. If you are not the named addressee you should not disseminate, distribute or copy this email. Please notify the sender immediately by email if you have received this email by mistake and delete this email from your system.
Glamtools mailing list
Glamtools@lists.wikimedia.org
--
faewik@gmail.com https://commons.wikimedia.org/wiki/User:Fae
Personal and confidential, please do not circulate or re-quote.
Glamtools mailing list
Glamtools@lists.wikimedia.org
_______________________________________________ Glamtools mailing list Glamtools@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/glamtools
As an example where the current behaviour of GWTools allowing duplicates is a problem; my NYPL uploads are very large files (up to ~300MB images), unfortunately there are instances where the library is giving multiple identities to the same scanned image: Three identical duplicates of a map as uploaded by GWT, two must be deleted at some point: 1. https://commons.wikimedia.org/wiki/File:Carta_dell%27_Egitto,_Sudan,_Mar_Ros... 2. https://commons.wikimedia.org/wiki/File:Carta_general_del_Oceano_Atlantico_%... 3. https://commons.wikimedia.org/wiki/File:Cartagena_NYPL1505044.tiff
The example file is 97MB and to test for this duplicate using the API myself, I would have to locally download a file, calculate the SHA-1 and then query the Commons API for possible duplicates. This would assume that the EXIF data had not been changed. Considering the sizes of the files and that this is a batch upload of more than 10,000 images, this is not practical and would in effect make the GWT irrelevant as I could then upload my local copy without bothering to create an xml and set up GWT.
Other checks I run when preparing my xml, such as by filename and NYPL unique ID, cannot find these duplicates. I currently have no idea how many digitally identical duplicates the GWT has allowed in the NYPL uploads, this is now a longer term post-upload housekeeping issue.
Fæ