On 03/05/2014, Estermann Beat <beat.estermann(a)bfh.ch> wrote:
Hi Dan,
It probably should still output a warning and list all identical files, so
they can be tackled manually after the upload.
Giving preference to the media file from the GLAM probably makes sense, but
you still want to substitute any other identical files, right?
When manually tackling identical files, the following potential issues
should be looked at:
- Existing files may already be included into Wikipedia articles - it
probably would make sense to replace them by the newly uploaded version
- Metadata of existing files may be more complete or complementary to the
metadata provided by the GLAM, especially if it has been enhanced by the
community (translations, etc.) - it certainly would make sense not to throw
away these additional metadata that have been contributed by the community.
- There might be derivatives based on existing files - it certainly makes
sense to ensure that they can be properly tracked to the original file
This may not be complete; maybe someone actively involved in uploads that
have encountered the problem of such duplicates wants to go through the
list, complement it and add it to the help/documentation pages...
Have a nice week end!
Beat
All good points. I think this is a good bug to document.
I would like to add an issue I have experience with my batch uploads
from the US Ministry of Defense (c. 40,000 photographs to date) and
the Imperial War Museums (c. 60,000 images?); in both these cases the
source website both "refreshes" the image with new versions under the
same link and unique identity. This means the SHA-1 is changing over
time, sometimes with just the EXIF data changing. An error I used to
make on these mass uploads was to rely on the SHA-1 as the means to
identify duplicates. The complexity of this problem means that I doubt
there is one fixed solution that fits all GLAMs, making the reporting
of uploaded duplicates an important feature, and probably the upload
behaviour (giving the user an option to overwrite or create
duplicates) is a feature that needs improvement to avoid lots of time
consuming post-upload housekeeping, along with the inevitable heated
volunteer complaints. :-)
As an example of an awkward long term backlog that is part of my
legacy of uploads, I have over 500 photographs that I still have to
review by hand at:
https://commons.wikimedia.org/wiki/Category:Images_from_DoD_uploaded_by_F%C…
Fae
--
faewik(a)gmail.com
https://commons.wikimedia.org/wiki/User:Fae