Re: [Multimedia] [Commons-l] Hashing Wikimedia Commons - Multimedia

4 Sep 2014


      Hi Jonas,
Awesome project!
I’m cc-ing the WMF Multimedia team, who might have some more answers :)
2014-09-04 12:26 GMT+02:00 Jonas Öberg jonas@commonsmachinery.se:
...
Dear all,
some of you may have been at our presentation during Wikimania and you'll
find this familiar, but for the rest of you, I'm working with Commons
Machinery on software that will hope to identify images on the web, even
when they are used outside of their original context, to provide automatic
attribution and a referral back to its origin. Imagine a blogger using a
photo from Commons, visiting that blog and having a browser plugin overlay
a small icon showing that the image is from Commons and inviting to find
out more - even if the blogger forgot to attribute.
We're currently working on an addon for Firefox to do just this, and we've
previously worked out a backend to store the information we need to make
these matches, some utilities for perceptual image hashing etc. We would
love to work with images from Wikimedia Commons as a first dataset to
explore how this will all work in practice.
But in order to do so, we need information from Commons, and we want to
make this as easy on the WMF servers as possible, so we'd appreciate some
help and pointers. What we're looking at retrieving is information about
(1) title, (2) author, (3) license, and (4) thumbnails of medium size.
The first three we can get from pretty much either API, or extract
directly from a dump file. The latter is eluding us though, for two
reasons. One is that a file, like 30C3_Commons_Machinery_2.jpg, is actually
in the /b/ba/ directory - but where this /b/ba/ comes from (a hash?) is
unclear to us now, and it's not something we find in the dumps - though we
can get it from one of the APIs.
The other is thumbnail sizes. We need to retrieve a reasonably sized image
(but in many cases less than the original size) of about 640px wide, so
that we can then run a perceptual hash algorithm on this file.
From what we can understand, you can request any size thumbnail on an
image simply by prefixing it with the size you want (like
123x-Filename.jpg). But it seems really silly to always request 640x for
instance, since that would mean the WMF servers would need to generate that
for us specifically if the resolution doesn't exist.
What we'd find much more appealing is to be able to determine before
making the call what sizes already exist and which can be retrieved without
the WMF servers needing to rescale them for us. And while the viewer on
Commons do seem to offer thumbnails in various sizes, we can't seem to get
that information from any API.
We can scrape the Commons web page for this information, but we figured
that people here might have good ideas for how we approach this with
minimal impact on the WMF servers :)
Sincerely,
Jonas

Commons-l mailing list
Commons-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/commons-l
-- 
Jean-Frédéric