Image ID. Looking for community input. - Commons-l

4 Dec 2006

Someone came to me with an image they believed they obtained from
commons but where unsure of exactly where they found it...

I was eventually able to locate the image, but it took a lot of work.

Had the image been deleted for copyright problems it would have been
nearly impossible but it is in exactly that sort of situation which we
may need the ability to find the image the most.

I have locally a database of image fingerprints (quantized color
histograms) which could be used to locate images... it didn't work for
this case because the image was newer than our last image backup
(which was last year). It might be useful but it's not a complete
solution.

I wanted to get input from the community on a couple of actions we
might like to take to improve the situation in the future:

# On upload we could attach the URL the image was uploaded as to the
image in an EXIF tag.  There are a great many EXIF tags defined and
I'm sure we could find a fitting one. This would only work for .JPG
but it would be easy to implement.  As a separate topic, we should
consider adding license data to our exif tags (I do it for my images,
but we should perhaps do it more generally.   This would be fairly
easy to do and I don't think this would be controversial, although
there would be some complexity with respect to image moves once we
gain that ability in the future.   Does anyone object to this?

# We could also add the same in the PNG comments... although such use
of png comments is non-standard .. I don't think it would break
anything. Anyone have any thoughts on that?

# We could add some RDF tags to SVGs for the same purpose, although I
think the PNG rasterizations of SVGs would be more important.

# Finally, something that might be somewhat controversial:   I think
it would be a good idea to add some text to the (raster) thumbnail
image on the image page. My idea is that we would add an extra white
area below the image large enough to contain a line of text which
mentions where the image came from. This would have a two fold
benefit: 1) it would encourage people to use the full resolution image
for reuse, 2) it would cause automated scraping processes which hit
our image pages to preserve human readable tracking information.
Unlike a classic watermark this addition could be removed via
cropping.  Technically this would require just a few more arguments to
imagemagik during thumbnailing, but we'd have to make a few other
changes to handle smaller images and to treat the image page thumb
differently from other same-sized thumbs.

In general I think we need to think about how to push our metadata
into the image files themselves. Only if the metadata is embedded in
the images will downstream users have a hope of keeping track of the
images. If the rest of the world did this our lives would be much
easier, so let us do on to other as we would have others do onto us.