Hi Jonas,

Awesome project!

I’m cc-ing the WMF Multimedia team, who might have some more answers :)


2014-09-04 12:26 GMT+02:00 Jonas Öberg <jonas@commonsmachinery.se>:
Dear all,

some of you may have been at our presentation during Wikimania and you'll find this familiar, but for the rest of you, I'm working with Commons Machinery on software that will hope to identify images on the web, even when they are used outside of their original context, to provide automatic attribution and a referral back to its origin. Imagine a blogger using a photo from Commons, visiting that blog and having a browser plugin overlay a small icon showing that the image is from Commons and inviting to find out more - even if the blogger forgot to attribute.

We're currently working on an addon for Firefox to do just this, and we've previously worked out a backend to store the information we need to make these matches, some utilities for perceptual image hashing etc. We would love to work with images from Wikimedia Commons as a first dataset to explore how this will all work in practice.

But in order to do so, we need information from Commons, and we want to make this as easy on the WMF servers as possible, so we'd appreciate some help and pointers. What we're looking at retrieving is information about (1) title, (2) author, (3) license, and (4) thumbnails of medium size.

The first three we can get from pretty much either API, or extract directly from a dump file. The latter is eluding us though, for two reasons. One is that a file, like 30C3_Commons_Machinery_2.jpg, is actually in the /b/ba/ directory - but where this /b/ba/ comes from (a hash?) is unclear to us now, and it's not something we find in the dumps - though we can get it from one of the APIs.

The other is thumbnail sizes. We need to retrieve a reasonably sized image (but in many cases less than the original size) of about 640px wide, so that we can then run a perceptual hash algorithm on this file.

From what we can understand, you can request any size thumbnail on an image simply by prefixing it with the size you want (like 123x-Filename.jpg). But it seems really silly to always request 640x for instance, since that would mean the WMF servers would need to generate that for us specifically if the resolution doesn't exist.

What we'd find much more appealing is to be able to determine before making the call what sizes already exist and which can be retrieved without the WMF servers needing to rescale them for us. And while the viewer on Commons do seem to offer thumbnails in various sizes, we can't seem to get that information from any API.

We can scrape the Commons web page for this information, but we figured that people here might have good ideas for how we approach this with minimal impact on the WMF servers :)

Sincerely,
Jonas


_______________________________________________
Commons-l mailing list
Commons-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/commons-l




--
Jean-Frédéric