On 03/05/2014, Estermann Beat beat.estermann@bfh.ch wrote:
Hi Dan,
It probably should still output a warning and list all identical files, so they can be tackled manually after the upload. Giving preference to the media file from the GLAM probably makes sense, but you still want to substitute any other identical files, right?
When manually tackling identical files, the following potential issues should be looked at:
- Existing files may already be included into Wikipedia articles - it
probably would make sense to replace them by the newly uploaded version
- Metadata of existing files may be more complete or complementary to the
metadata provided by the GLAM, especially if it has been enhanced by the community (translations, etc.) - it certainly would make sense not to throw away these additional metadata that have been contributed by the community.
- There might be derivatives based on existing files - it certainly makes
sense to ensure that they can be properly tracked to the original file
This may not be complete; maybe someone actively involved in uploads that have encountered the problem of such duplicates wants to go through the list, complement it and add it to the help/documentation pages...
Have a nice week end!
Beat
All good points. I think this is a good bug to document.
I would like to add an issue I have experience with my batch uploads from the US Ministry of Defense (c. 40,000 photographs to date) and the Imperial War Museums (c. 60,000 images?); in both these cases the source website both "refreshes" the image with new versions under the same link and unique identity. This means the SHA-1 is changing over time, sometimes with just the EXIF data changing. An error I used to make on these mass uploads was to rely on the SHA-1 as the means to identify duplicates. The complexity of this problem means that I doubt there is one fixed solution that fits all GLAMs, making the reporting of uploaded duplicates an important feature, and probably the upload behaviour (giving the user an option to overwrite or create duplicates) is a feature that needs improvement to avoid lots of time consuming post-upload housekeeping, along with the inevitable heated volunteer complaints. :-)
As an example of an awkward long term backlog that is part of my legacy of uploads, I have over 500 photographs that I still have to review by hand at: https://commons.wikimedia.org/wiki/Category:Images_from_DoD_uploaded_by_F%C3...
Fae