Someone came to me with an image they believed they obtained from commons but where unsure of exactly where they found it...
I was eventually able to locate the image, but it took a lot of work.
Had the image been deleted for copyright problems it would have been nearly impossible but it is in exactly that sort of situation which we may need the ability to find the image the most.
I have locally a database of image fingerprints (quantized color histograms) which could be used to locate images... it didn't work for this case because the image was newer than our last image backup (which was last year). It might be useful but it's not a complete solution.
I wanted to get input from the community on a couple of actions we might like to take to improve the situation in the future:
# On upload we could attach the URL the image was uploaded as to the image in an EXIF tag. There are a great many EXIF tags defined and I'm sure we could find a fitting one. This would only work for .JPG but it would be easy to implement. As a separate topic, we should consider adding license data to our exif tags (I do it for my images, but we should perhaps do it more generally. This would be fairly easy to do and I don't think this would be controversial, although there would be some complexity with respect to image moves once we gain that ability in the future. Does anyone object to this?
# We could also add the same in the PNG comments... although such use of png comments is non-standard .. I don't think it would break anything. Anyone have any thoughts on that?
# We could add some RDF tags to SVGs for the same purpose, although I think the PNG rasterizations of SVGs would be more important.
# Finally, something that might be somewhat controversial: I think it would be a good idea to add some text to the (raster) thumbnail image on the image page. My idea is that we would add an extra white area below the image large enough to contain a line of text which mentions where the image came from. This would have a two fold benefit: 1) it would encourage people to use the full resolution image for reuse, 2) it would cause automated scraping processes which hit our image pages to preserve human readable tracking information. Unlike a classic watermark this addition could be removed via cropping. Technically this would require just a few more arguments to imagemagik during thumbnailing, but we'd have to make a few other changes to handle smaller images and to treat the image page thumb differently from other same-sized thumbs.
In general I think we need to think about how to push our metadata into the image files themselves. Only if the metadata is embedded in the images will downstream users have a hope of keeping track of the images. If the rest of the world did this our lives would be much easier, so let us do on to other as we would have others do onto us.
On 12/5/06, Gregory Maxwell gmaxwell@gmail.com wrote:
Someone came to me with an image they believed they obtained from commons but where unsure of exactly where they found it...
I was eventually able to locate the image, but it took a lot of work.
Had the image been deleted for copyright problems it would have been nearly impossible but it is in exactly that sort of situation which we may need the ability to find the image the most.
I have locally a database of image fingerprints (quantized color histograms) which could be used to locate images... it didn't work for this case because the image was newer than our last image backup (which was last year). It might be useful but it's not a complete solution.
A few hours ago, I wrote about adding an MD5 hash (or the like) to each image entry in the database. That would have helped finding the image in question as well, except if it has been altered.
I wanted to get input from the community on a couple of actions we might like to take to improve the situation in the future:
# On upload we could attach the URL the image was uploaded as to the image in an EXIF tag. There are a great many EXIF tags defined and I'm sure we could find a fitting one. This would only work for .JPG but it would be easy to implement.
That could be done as part of the upload process. If we eventually enable copy-from-web (again, some code of mine deactivated for unknown reasons; next time, I'll set these things to "on" by default, so the gods in charge can't ignore it forever like they do now) we could also include the "original" (pre-commons) URL.
As a separate topic, we should consider adding license data to our exif tags (I do it for my images, but we should perhaps do it more generally. This would be fairly easy to do and I don't think this would be controversial, although there would be some complexity with respect to image moves once we gain that ability in the future. Does anyone object to this?
# We could also add the same in the PNG comments... although such use of png comments is non-standard .. I don't think it would break anything. Anyone have any thoughts on that?
# We could add some RDF tags to SVGs for the same purpose, although I think the PNG rasterizations of SVGs would be more important.
Adding licenses to the image will require changing the image on any license-altereing edit to the description page. It also means we need to parse said description for license tags. Unless, of course, we limit this function to the license set on the upload page.
That said, I think either is a good idea.
# Finally, something that might be somewhat controversial: I think it would be a good idea to add some text to the (raster) thumbnail image on the image page. My idea is that we would add an extra white area below the image large enough to contain a line of text which mentions where the image came from. This would have a two fold benefit: 1) it would encourage people to use the full resolution image for reuse, 2) it would cause automated scraping processes which hit our image pages to preserve human readable tracking information. Unlike a classic watermark this addition could be removed via cropping. Technically this would require just a few more arguments to imagemagik during thumbnailing, but we'd have to make a few other changes to handle smaller images and to treat the image page thumb differently from other same-sized thumbs.
I'm not sure it's worth the effort. 1) We already link to the high-res version in the line below the image. Altering the thumbnail requires people to edit the image if they don't want the high-res version (maybe they're on a modem?) 2) That would be useful for automated image-scrapers that don't use the page as well and don't link back to the commons. Do you have an example for this? Also, IMHO such a bar would uglify (is that a word?) most images. And, our JPG thumbnails are JPGs as well; depending on the compression, JPGs don't render (small) text very well.
In general I think we need to think about how to push our metadata into the image files themselves. Only if the metadata is embedded in the images will downstream users have a hope of keeping track of the images. If the rest of the world did this our lives would be much easier, so let us do on to other as we would have others do onto us.
Agreed. But I'd also like for us to use existing data more within the system. We already use EXIF data to categorize camera models, IIRC? The images themselves contain data (color etc.); how about "similar images"? (yes, I know that's a big one, just dreaming here;-)
I created a new flickr account a few days ago, and I very much like the "feel" of it. THe whole site screams that it's designed for images. Maybe we should think about tag/category clouds, pre-link various image sizes, integrate mass-organization (like "show me my images, select this and that, tag them with category XYZ"). I'm not saying we should become flickr, but we should learn from them.
Magnus
Magnus Manske wrote:
On 12/5/06, Gregory Maxwell gmaxwell@gmail.com wrote:
Someone came to me with an image they believed they obtained from commons but where unsure of exactly where they found it...
I was eventually able to locate the image, but it took a lot of work.
Had the image been deleted for copyright problems it would have been nearly impossible but it is in exactly that sort of situation which we may need the ability to find the image the most.
I have locally a database of image fingerprints (quantized color histograms) which could be used to locate images... it didn't work for this case because the image was newer than our last image backup (which was last year). It might be useful but it's not a complete solution.
A few hours ago, I wrote about adding an MD5 hash (or the like) to each image entry in the database. That would have helped finding the image in question as well, except if it has been altered.
A Google search for "image hash" gives 878 results (curses, [[Image:Hash function.svg]] shows up several times); surely there's a good one with a free implementation /somewhere/?
On 12/5/06, Alphax (Wikipedia email) alphasigmax@gmail.com wrote:
A Google search for "image hash" gives 878 results (curses, [[Image:Hash function.svg]] shows up several times); surely there's a good one with a free implementation /somewhere/?
The subject you really want to search is "image indexing" and it's a surprisingly immature area.
A simple exact match hash will be kept by the later image storage system. It would be useful for some thing (detecting exact digital duplicates), but wouldn't have helped with my image (which was a copy of the image page thumbnail).