Someone came to me with an image they believed they obtained from commons but where unsure of exactly where they found it...
I was eventually able to locate the image, but it took a lot of work.
Had the image been deleted for copyright problems it would have been nearly impossible but it is in exactly that sort of situation which we may need the ability to find the image the most.
I have locally a database of image fingerprints (quantized color histograms) which could be used to locate images... it didn't work for this case because the image was newer than our last image backup (which was last year). It might be useful but it's not a complete solution.
I wanted to get input from the community on a couple of actions we might like to take to improve the situation in the future:
# On upload we could attach the URL the image was uploaded as to the image in an EXIF tag. There are a great many EXIF tags defined and I'm sure we could find a fitting one. This would only work for .JPG but it would be easy to implement. As a separate topic, we should consider adding license data to our exif tags (I do it for my images, but we should perhaps do it more generally. This would be fairly easy to do and I don't think this would be controversial, although there would be some complexity with respect to image moves once we gain that ability in the future. Does anyone object to this?
# We could also add the same in the PNG comments... although such use of png comments is non-standard .. I don't think it would break anything. Anyone have any thoughts on that?
# We could add some RDF tags to SVGs for the same purpose, although I think the PNG rasterizations of SVGs would be more important.
# Finally, something that might be somewhat controversial: I think it would be a good idea to add some text to the (raster) thumbnail image on the image page. My idea is that we would add an extra white area below the image large enough to contain a line of text which mentions where the image came from. This would have a two fold benefit: 1) it would encourage people to use the full resolution image for reuse, 2) it would cause automated scraping processes which hit our image pages to preserve human readable tracking information. Unlike a classic watermark this addition could be removed via cropping. Technically this would require just a few more arguments to imagemagik during thumbnailing, but we'd have to make a few other changes to handle smaller images and to treat the image page thumb differently from other same-sized thumbs.
In general I think we need to think about how to push our metadata into the image files themselves. Only if the metadata is embedded in the images will downstream users have a hope of keeping track of the images. If the rest of the world did this our lives would be much easier, so let us do on to other as we would have others do onto us.